Fundamentos análisis de datos - Práctica final

David Durán Prieto Gerardo Adrián Aguirre Vivar Ana Jiménez Santamaría


1. Contexto

El data set que ha sido elegido proviene de una encuesta realizada por la PSA (Philippine Statistics Authority) donde se recogen los gastos e ingresos por familia en las Islas Filipinas. Contiene más de 40000 observaciones y 60 variables, que han sido agrupadas en las siguientes categorías:

  • Gastos
  • Datos demográficos de familias
  • Datos demográficos de la persona principal en la toma de decisiones
  • Estructura de la casa
  • Número de bienes adquiridos

Durante varios años, identificar un modelo de clasificación socio-económico óptimo en Filipinas ha sido un tema difícil de abordar. A día de hoy, ningun modelo ha sido aceptado de forma global, y los diferentes organismos gubernamentales que existen utilizan sus propios modelos. Por ello, el presente trabajo se plantea un objetivo: diseñar un modelo que consiga abordar el problema y resolverlo de manera eficaz.

2. Objetivo - Pregunta - Target

Objetivo: Predecir los ingresos de una familia filipina, basándse en los datos disponibles. Pregunta: A partir de un modelo de regresión lineal múltiple, ¿qué variables son las más adecuadas para predecir los ingresos? Target: La variable respuesta es el total de ingresos de cada familia filipina (Total.Household.Income)

3. Procedimiento

El análisis de dividirá en dos fases:

  • La primera fase consistirá en un análisis exploratorio de los datos para entender mejor el significado y la relevancia de cada una de las variables. Se estudiarán puntos clave como el nivel de correlación entre la variable de interés y las demás. Por ello, para cada variable estudiada, se planteará:

    • ¿Se considerará esta variable al predecir los ingresos familiares? Es decir, ¿formará parte del modelo diseñado?
    • Si es así, ¿cómo de relevante es esta variable a la hora de determinar la variable respuesta?
  • La segunda fase consistirá en la elaboración de un modelo de regresión lineal múltiple con las variables predictoras seleccionadas.


4. Análisis exploratorio de los datos

4.1. Procesado inicial

Antes de proceder con la visualización gráfica de las variables (para tener un visión de la distribución de nuestros datos), será realizado un preprocesamiento y limpieza del conjunto de datos. Serán etiquetados como NA aquellos valores que así deban considerarse; se eliminarán ciertas variables por no presentar interés para el objetivo planteado, y por último, seran preparados los conjuntos de test/validación y de train. Este último será el que sirva para entrenar el modelo de predicción, que será después evaluado con el conjunto de test/validación.

# ----- Se cargan las librerías que serán necesarias ------

library(dplyr)
library(tidyr)
library(ggplot2)
library(forcats)
library(GGally)
library(gridExtra)
library(egg)
library(VIM)
library(vcd)
library(Hmisc)
library(readr)
library(moments)
library(caret)
library(gmodels)
library(reshape)
library(ggcorrplot)

A continuación, se realizará un resumen de los estadísticos principales de las variables numéricas para ver su media, desviación típica, número total de muestras y valores faltanes en cada variable. Curiosamente, solo se encuentran datos faltantes en las variables categóricas, que más adelante se tratarán.

# ----- Carga de datos -----

datos<-read.csv('Family_Income_and_Expenditure.csv',stringsAsFactors = TRUE)
datos_occupation <- datos

# ----- Resumen numérico de las variables -----

summary(datos)
##  Total.Household.Income                   Region      Total.Food.Expenditure
##  Min.   :   11285       IVA - CALABARZON     : 4162   Min.   :  2947        
##  1st Qu.:  104895       NCR                  : 4130   1st Qu.: 51017        
##  Median :  164080       III - Central Luzon  : 3237   Median : 72986        
##  Mean   :  247556       VI - Western Visayas : 2851   Mean   : 85099        
##  3rd Qu.:  291138       VII - Central Visayas: 2541   3rd Qu.:105636        
##  Max.   :11815988       V - Bicol Region     : 2472   Max.   :827565        
##                         (Other)              :22151                         
##                 Main.Source.of.Income Agricultural.Household.indicator
##  Enterpreneurial Activities:10320     Min.   :0.0000                  
##  Other sources of Income   :10836     1st Qu.:0.0000                  
##  Wage/Salaries             :20388     Median :0.0000                  
##                                       Mean   :0.4299                  
##                                       3rd Qu.:1.0000                  
##                                       Max.   :2.0000                  
##                                                                       
##  Bread.and.Cereals.Expenditure Total.Rice.Expenditure Meat.Expenditure
##  Min.   :     0                Min.   :     0         Min.   :     0  
##  1st Qu.: 16556                1st Qu.: 11020         1st Qu.:  3354  
##  Median : 23324                Median : 16620         Median :  7332  
##  Mean   : 25134                Mean   : 18196         Mean   : 10540  
##  3rd Qu.: 31439                3rd Qu.: 23920         3rd Qu.: 14292  
##  Max.   :765864                Max.   :758326         Max.   :261566  
##                                                                       
##  Total.Fish.and..marine.products.Expenditure Fruit.Expenditure
##  Min.   :     0                              Min.   :     0   
##  1st Qu.:  5504                              1st Qu.:  1025   
##  Median :  8695                              Median :  1820   
##  Mean   : 10529                              Mean   :  2550   
##  3rd Qu.: 13388                              3rd Qu.:  3100   
##  Max.   :188208                              Max.   :273769   
##                                                               
##  Vegetables.Expenditure Restaurant.and.hotels.Expenditure
##  Min.   :    0          Min.   :     0                   
##  1st Qu.: 2873          1st Qu.:  1930                   
##  Median : 4314          Median :  7314                   
##  Mean   : 5007          Mean   : 15437                   
##  3rd Qu.: 6304          3rd Qu.: 19921                   
##  Max.   :74800          Max.   :725296                   
##                                                          
##  Alcoholic.Beverages.Expenditure Tobacco.Expenditure
##  Min.   :    0                   Min.   :     0     
##  1st Qu.:    0                   1st Qu.:     0     
##  Median :  270                   Median :   300     
##  Mean   : 1085                   Mean   :  2295     
##  3rd Qu.: 1299                   3rd Qu.:  3146     
##  Max.   :59592                   Max.   :139370     
##                                                     
##  Clothing..Footwear.and.Other.Wear.Expenditure Housing.and.water.Expenditure
##  Min.   :     0                                Min.   :   1950              
##  1st Qu.:  1365                                1st Qu.:  13080              
##  Median :  2740                                Median :  22992              
##  Mean   :  4955                                Mean   :  38376              
##  3rd Qu.:  5580                                3rd Qu.:  45948              
##  Max.   :356750                                Max.   :2188560              
##                                                                             
##  Imputed.House.Rental.Value Medical.Care.Expenditure Transportation.Expenditure
##  Min.   :      0            Min.   :      0          Min.   :     0            
##  1st Qu.:   6000            1st Qu.:    300          1st Qu.:  2412            
##  Median :  10800            Median :   1125          Median :  6036            
##  Mean   :  20922            Mean   :   7160          Mean   : 11806            
##  3rd Qu.:  24000            3rd Qu.:   4680          3rd Qu.: 13776            
##  Max.   :1920000            Max.   :1049275          Max.   :834996            
##                                                                                
##  Communication.Expenditure Education.Expenditure
##  Min.   :     0            Min.   :     0       
##  1st Qu.:   564            1st Qu.:     0       
##  Median :  1506            Median :   880       
##  Mean   :  4095            Mean   :  7474       
##  3rd Qu.:  3900            3rd Qu.:  4060       
##  Max.   :149940            Max.   :731000       
##                                                 
##  Miscellaneous.Goods.and.Services.Expenditure Special.Occasions.Expenditure
##  Min.   :     0                               Min.   :     0               
##  1st Qu.:  3792                               1st Qu.:     0               
##  Median :  6804                               Median :  1500               
##  Mean   : 12522                               Mean   :  5266               
##  3rd Qu.: 14154                               3rd Qu.:  5000               
##  Max.   :553560                               Max.   :556700               
##                                                                            
##  Crop.Farming.and.Gardening.expenses
##  Min.   :      0                    
##  1st Qu.:      0                    
##  Median :      0                    
##  Mean   :  13817                    
##  3rd Qu.:   6313                    
##  Max.   :3729973                    
##                                     
##  Total.Income.from.Entrepreneurial.Acitivites Household.Head.Sex
##  Min.   :      0                              Female: 9061      
##  1st Qu.:      0                              Male  :32483      
##  Median :  19222                                                
##  Mean   :  54376                                                
##  3rd Qu.:  65969                                                
##  Max.   :9234485                                                
##                                                                 
##  Household.Head.Age    Household.Head.Marital.Status
##  Min.   : 9.00      Annulled          :   11        
##  1st Qu.:41.00      Divorced/Separated: 1425        
##  Median :51.00      Married           :31347        
##  Mean   :51.38      Single            : 1942        
##  3rd Qu.:61.00      Unknown           :    1        
##  Max.   :99.00      Widowed           : 6818        
##                                                     
##      Household.Head.Highest.Grade.Completed
##  High School Graduate   : 9628             
##  Elementary Graduate    : 7640             
##  Grade 4                : 2282             
##  Grade 5                : 2123             
##  Second Year High School: 2104             
##  Grade 3                : 1994             
##  (Other)                :15773             
##  Household.Head.Job.or.Business.Indicator
##  No Job/Business  : 7536                 
##  With Job/Business:34008                 
##                                          
##                                          
##                                          
##                                          
##                                          
##                                                                        Household.Head.Occupation
##  Farmhands and laborers                                                             : 3478      
##  Rice farmers                                                                       : 2849      
##  General managers/managing proprietors in wholesale and retail trade                : 2028      
##  General managers/managing proprietors in transportation, storage and communications: 1932      
##  Corn farmers                                                                       : 1724      
##  (Other)                                                                            :21997      
##  NA's                                                                               : 7536      
##                                   Household.Head.Class.of.Worker
##  Self-employed wihout any employee               :13766         
##  Worked for private establishment                :13731         
##  Worked for government/government corporation    : 2820         
##  Employer in own family-operated farm or business: 2581         
##  Worked for private household                    :  811         
##  (Other)                                         :  299         
##  NA's                                            : 7536         
##                               Type.of.Household Total.Number.of.Family.members
##  Extended Family                       :12932   Min.   : 1.000                
##  Single Family                         :28445   1st Qu.: 3.000                
##  Two or More Nonrelated Persons/Members:  167   Median : 4.000                
##                                                 Mean   : 4.635                
##                                                 3rd Qu.: 6.000                
##                                                 Max.   :26.000                
##                                                                               
##  Members.with.age.less.than.5.year.old Members.with.age.5...17.years.old
##  Min.   :0.0000                        Min.   :0.000                    
##  1st Qu.:0.0000                        1st Qu.:0.000                    
##  Median :0.0000                        Median :1.000                    
##  Mean   :0.4102                        Mean   :1.363                    
##  3rd Qu.:1.0000                        3rd Qu.:2.000                    
##  Max.   :5.0000                        Max.   :8.000                    
##                                                                         
##  Total.number.of.family.members.employed
##  Min.   :0.000                          
##  1st Qu.:0.000                          
##  Median :1.000                          
##  Mean   :1.273                          
##  3rd Qu.:2.000                          
##  Max.   :8.000                          
##                                         
##                                  Type.of.Building.House
##  Commercial/industrial/agricultural building:   51     
##  Duplex                                     : 1084     
##  Institutional living quarter               :    9     
##  Multi-unit residential                     : 1329     
##  Other building unit (e.g. cave, boat)      :    2     
##  Single house                               :39069     
##                                                        
##                                                                  Type.of.Roof  
##  Light material (cogon,nipa,anahaw)                                    : 5074  
##  Mixed but predominantly light materials                               :  846  
##  Mixed but predominantly salvaged materials                            :   56  
##  Mixed but predominantly strong materials                              : 2002  
##  Not Applicable                                                        :   12  
##  Salvaged/makeshift materials                                          :  212  
##  Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos):33342  
##         Type.of.Walls   House.Floor.Area   House.Age      Number.of.bedrooms
##  Light         : 8267   Min.   :  5.0    Min.   :  0.00   Min.   :0.000     
##  NOt applicable:   12   1st Qu.: 25.0    1st Qu.: 10.00   1st Qu.:1.000     
##  Quite Strong  : 3487   Median : 40.0    Median : 17.00   Median :2.000     
##  Salvaged      :  456   Mean   : 55.6    Mean   : 20.13   Mean   :1.788     
##  Strong        :27739   3rd Qu.: 70.0    3rd Qu.: 26.00   3rd Qu.:2.000     
##  Very Light    : 1583   Max.   :998.0    Max.   :200.00   Max.   :9.000     
##                                                                             
##                                            Tenure.Status  
##  Own or owner-like possession of house and lot    :29541  
##  Own house, rent-free lot with consent of owner   : 6165  
##  Rent house/room including lot                    : 2203  
##  Rent-free house and lot with consent of owner    : 2014  
##  Own house, rent-free lot without consent of owner:  995  
##  Own house, rent lot                              :  425  
##  (Other)                                          :  201  
##                                                       Toilet.Facilities
##  Water-sealed, sewer septic tank, used exclusively by household:29162  
##  Water-sealed, sewer septic tank, shared with other household  : 3694  
##  Water-sealed, other depository, used exclusively by household : 2343  
##  Closed pit                                                    : 2273  
##  None                                                          : 1580  
##  Open pit                                                      : 1189  
##  (Other)                                                       : 1303  
##   Electricity                              Main.Source.of.Water.Supply
##  Min.   :0.0000   Own use, faucet, community water system:16093       
##  1st Qu.:1.0000   Shared, tubed/piped deep well          : 6242       
##  Median :1.0000   Shared, faucet, community water system : 4614       
##  Mean   :0.8908   Own use, tubed/piped deep well         : 4587       
##  3rd Qu.:1.0000   Dug well                               : 3876       
##  Max.   :1.0000   Protected spring, river, stream, etc   : 2657       
##                   (Other)                                : 3475       
##  Number.of.Television Number.of.CD.VCD.DVD Number.of.Component.Stereo.set
##  Min.   :0.0000       Min.   :0.0000       Min.   :0.0000                
##  1st Qu.:0.0000       1st Qu.:0.0000       1st Qu.:0.0000                
##  Median :1.0000       Median :0.0000       Median :0.0000                
##  Mean   :0.8569       Mean   :0.4352       Mean   :0.1621                
##  3rd Qu.:1.0000       3rd Qu.:1.0000       3rd Qu.:0.0000                
##  Max.   :6.0000       Max.   :5.0000       Max.   :5.0000                
##                                                                          
##  Number.of.Refrigerator.Freezer Number.of.Washing.Machine
##  Min.   :0.0000                 Min.   :0.0000           
##  1st Qu.:0.0000                 1st Qu.:0.0000           
##  Median :0.0000                 Median :0.0000           
##  Mean   :0.3942                 Mean   :0.3198           
##  3rd Qu.:1.0000                 3rd Qu.:1.0000           
##  Max.   :5.0000                 Max.   :3.0000           
##                                                          
##  Number.of.Airconditioner Number.of.Car..Jeep..Van
##  Min.   :0.0000           Min.   :0.00000         
##  1st Qu.:0.0000           1st Qu.:0.00000         
##  Median :0.0000           Median :0.00000         
##  Mean   :0.1298           Mean   :0.08121         
##  3rd Qu.:0.0000           3rd Qu.:0.00000         
##  Max.   :5.0000           Max.   :5.00000         
##                                                   
##  Number.of.Landline.wireless.telephones Number.of.Cellular.phone
##  Min.   :0.00000                        Min.   : 0.000          
##  1st Qu.:0.00000                        1st Qu.: 1.000          
##  Median :0.00000                        Median : 2.000          
##  Mean   :0.06061                        Mean   : 1.906          
##  3rd Qu.:0.00000                        3rd Qu.: 3.000          
##  Max.   :4.00000                        Max.   :10.000          
##                                                                 
##  Number.of.Personal.Computer Number.of.Stove.with.Oven.Gas.Range
##  Min.   :0.000               Min.   :0.000                      
##  1st Qu.:0.000               1st Qu.:0.000                      
##  Median :0.000               Median :0.000                      
##  Mean   :0.315               Mean   :0.135                      
##  3rd Qu.:0.000               3rd Qu.:0.000                      
##  Max.   :6.000               Max.   :3.000                      
##                                                                 
##  Number.of.Motorized.Banca Number.of.Motorcycle.Tricycle
##  Min.   :0.00000           Min.   :0.0000               
##  1st Qu.:0.00000           1st Qu.:0.0000               
##  Median :0.00000           Median :0.0000               
##  Mean   :0.01312           Mean   :0.2899               
##  3rd Qu.:0.00000           3rd Qu.:0.0000               
##  Max.   :3.00000           Max.   :5.0000               
## 
# ----- Datos faltantes en el dataset -----

describe(datos)
## datos 
## 
##  60  Variables      41544  Observations
## --------------------------------------------------------------------------------
## Total.Household.Income 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0    38670        1   247556   219756    56072    71596 
##      .25      .50      .75      .90      .95 
##   104895   164080   291138   502021   692298 
## 
## lowest :    11285    11988    12039    12141    12911
## highest:  6452314  7082152  9952913 11639365 11815988
## --------------------------------------------------------------------------------
## Region 
##        n  missing distinct 
##    41544        0       17 
## 
## lowest :  ARMM                  CAR                    Caraga                 I - Ilocos Region      II - Cagayan Valley   
## highest: VII - Central Visayas  VIII - Eastern Visayas X - Northern Mindanao  XI - Davao Region      XII - SOCCSKSARGEN    
## --------------------------------------------------------------------------------
## Total.Food.Expenditure 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0    35776        1    85099    52059    27956    35654 
##      .25      .50      .75      .90      .95 
##    51017    72986   105636   148255   181991 
## 
## lowest :   2947   3704   5408   5482   5638, highest: 691917 720007 729606 791848 827565
## --------------------------------------------------------------------------------
## Main.Source.of.Income 
##        n  missing distinct 
##    41544        0        3 
##                                                                 
## Value      Enterpreneurial Activities    Other sources of Income
## Frequency                       10320                      10836
## Proportion                      0.248                      0.261
##                                      
## Value                   Wage/Salaries
## Frequency                       20388
## Proportion                      0.491
## --------------------------------------------------------------------------------
## Agricultural.Household.indicator 
##        n  missing distinct     Info     Mean      Gmd 
##    41544        0        3    0.679   0.4299   0.6278 
##                             
## Value          0     1     2
## Frequency  28106  9018  4420
## Proportion 0.677 0.217 0.106
## --------------------------------------------------------------------------------
## Bread.and.Cereals.Expenditure 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0    26082        1    25134    13311     8552    11487 
##      .25      .50      .75      .90      .95 
##    16556    23324    31439    40385    46887 
## 
## lowest :      0     25     31     32     42, highest: 270612 338818 345643 437467 765864
## --------------------------------------------------------------------------------
## Total.Rice.Expenditure 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0    16145        1    18196    11582     3237     6188 
##      .25      .50      .75      .90      .95 
##    11020    16620    23920    31481    36940 
## 
## lowest :      0      1      2      8     10, highest: 189906 206702 343907 429640 758326
## --------------------------------------------------------------------------------
## Meat.Expenditure 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0    18619        1    10540    10234      890     1510 
##      .25      .50      .75      .90      .95 
##     3354     7332    14292    23697    30951 
## 
## lowest :      0     16     18     22     25, highest: 114504 119230 132142 140992 261566
## --------------------------------------------------------------------------------
## Total.Fish.and..marine.products.Expenditure 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0    18014        1    10529     7604     2438     3461 
##      .25      .50      .75      .90      .95 
##     5504     8695    13388    19431    24490 
## 
## lowest :      0     10     26     36     40, highest:  98288 113749 119640 125802 188208
## --------------------------------------------------------------------------------
## Fruit.Expenditure 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0     7140        1     2550     2308      390      583 
##      .25      .50      .75      .90      .95 
##     1025     1820     3100     5190     7120 
## 
## lowest :      0      4      5     10     12, highest:  47042  48980  69319  82600 273769
## --------------------------------------------------------------------------------
## Vegetables.Expenditure 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0    10599        1     5007     3296     1330     1861 
##      .25      .50      .75      .90      .95 
##     2873     4314     6304     8854    10886 
## 
## lowest :     0     6    25    30    33, highest: 49000 49810 52401 55230 74800
## --------------------------------------------------------------------------------
## Restaurant.and.hotels.Expenditure 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0    12367    0.999    15437    19509        0      120 
##      .25      .50      .75      .90      .95 
##     1930     7314    19921    39629    57064 
## 
## lowest :      0      1      3      4     10, highest: 519820 523230 597150 625200 725296
## --------------------------------------------------------------------------------
## Alcoholic.Beverages.Expenditure 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0     4084    0.934     1085     1610        0        0 
##      .25      .50      .75      .90      .95 
##        0      270     1299     3000     4602 
## 
## lowest :     0     5     9    10    12, highest: 44400 44704 46950 51688 59592
## --------------------------------------------------------------------------------
## Tobacco.Expenditure 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0     3118    0.897     2295     3396        0        0 
##      .25      .50      .75      .90      .95 
##        0      300     3146     7240    10498 
## 
## lowest :      0      2      3      4      5, highest:  56380  61359  73881  97740 139370
## --------------------------------------------------------------------------------
## Clothing..Footwear.and.Other.Wear.Expenditure 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0     9819        1     4955     5561      350      650 
##      .25      .50      .75      .90      .95 
##     1365     2740     5580    11126    16806 
## 
## lowest :      0     12     20     25     30, highest: 174242 191756 212925 217500 356750
## --------------------------------------------------------------------------------
## Housing.and.water.Expenditure 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0    13243        1    38375    38124     7020     8832 
##      .25      .50      .75      .90      .95 
##    13080    22992    45948    80520   114210 
## 
## lowest :    1950    1980    2100    2112    2118
## highest: 1403310 1458300 1468476 1663812 2188560
## --------------------------------------------------------------------------------
## Imputed.House.Rental.Value 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0      266    0.998    20922    24331     1200     3000 
##      .25      .50      .75      .90      .95 
##     6000    10800    24000    48000    66000 
## 
## lowest :       0     600     720     900     960
## highest: 1020000 1080000 1200000 1500000 1920000
## --------------------------------------------------------------------------------
## Medical.Care.Expenditure 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0    11887        1     7160    11638       30       92 
##      .25      .50      .75      .90      .95 
##      300     1125     4680    15287    30005 
## 
## lowest :       0       5       6       7       8
## highest:  767726  900279  973700 1038512 1049275
## --------------------------------------------------------------------------------
## Transportation.Expenditure 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0     7435        1    11806    14087      600     1026 
##      .25      .50      .75      .90      .95 
##     2412     6036    13776    27492    41026 
## 
## lowest :      0     12     18     24     30, highest: 481098 530322 539004 601890 834996
## --------------------------------------------------------------------------------
## Communication.Expenditure 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0     3826    0.999     4095     5584        0        0 
##      .25      .50      .75      .90      .95 
##      564     1506     3900    11280    18720 
## 
## lowest :      0     12     18     24     30, highest: 101982 110160 111360 112500 149940
## --------------------------------------------------------------------------------
## Education.Expenditure 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0     6893    0.974     7474    12380        0        0 
##      .25      .50      .75      .90      .95 
##        0      880     4060    21350    38750 
## 
## lowest :      0      5     10     12     15, highest: 498178 502600 669400 700000 731000
## --------------------------------------------------------------------------------
## Miscellaneous.Goods.and.Services.Expenditure 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0     7669        1    12522    13527     1566     2232 
##      .25      .50      .75      .90      .95 
##     3792     6804    14154    28816    41795 
## 
## lowest :      0     18     60     78     90, highest: 365484 368628 437424 447318 553560
## --------------------------------------------------------------------------------
## Special.Occasions.Expenditure 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0     3412    0.968     5266     7949        0        0 
##      .25      .50      .75      .90      .95 
##        0     1500     5000    12750    21697 
## 
## lowest :      0      4      8     10     15, highest: 277860 290000 300000 340000 556700
## --------------------------------------------------------------------------------
## Crop.Farming.and.Gardening.expenses 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0     9961    0.644    13817    24027        0        0 
##      .25      .50      .75      .90      .95 
##        0        0     6313    45113    78205 
## 
## lowest :       0      10      20      25      30
## highest: 1331340 1370800 1779690 2823280 3729973
## --------------------------------------------------------------------------------
## Total.Income.from.Entrepreneurial.Acitivites 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0    20204    0.957    54376    78766        0        0 
##      .25      .50      .75      .90      .95 
##        0    19222    65969   126924   191197 
## 
## lowest :       0      16      20      26      45
## highest: 5107451 5749030 5790000 6576302 9234485
## --------------------------------------------------------------------------------
## Household.Head.Sex 
##        n  missing distinct 
##    41544        0        2 
##                         
## Value      Female   Male
## Frequency    9061  32483
## Proportion  0.218  0.782
## --------------------------------------------------------------------------------
## Household.Head.Age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0       89        1    51.38    16.11       29       33 
##      .25      .50      .75      .90      .95 
##       41       51       61       71       76 
## 
## lowest :  9 10 13 14 15, highest: 95 96 97 98 99
## --------------------------------------------------------------------------------
## Household.Head.Marital.Status 
##        n  missing distinct 
##    41544        0        6 
## 
## lowest : Annulled           Divorced/Separated Married            Single             Unknown           
## highest: Divorced/Separated Married            Single             Unknown            Widowed           
##                                                                    
## Value                Annulled Divorced/Separated            Married
## Frequency                  11               1425              31347
## Proportion              0.000              0.034              0.755
##                                                                    
## Value                  Single            Unknown            Widowed
## Frequency                1942                  1               6818
## Proportion              0.047              0.000              0.164
## --------------------------------------------------------------------------------
## Household.Head.Highest.Grade.Completed 
##        n  missing distinct 
##    41544        0       46 
## 
## lowest : Agriculture, Forestry, and Fishery Programs      Architecture and Building Programs               Arts Programs                                    Basic Programs                                   Business and Administration Programs            
## highest: Teacher Training and Education Sciences Programs Third Year College                               Third Year High School                           Transport Services Programs                      Veterinary Programs                             
## --------------------------------------------------------------------------------
## Household.Head.Job.or.Business.Indicator 
##        n  missing distinct 
##    41544        0        2 
##                                               
## Value        No Job/Business With Job/Business
## Frequency               7536             34008
## Proportion             0.181             0.819
## --------------------------------------------------------------------------------
## Household.Head.Occupation 
##        n  missing distinct 
##    34008     7536      378 
## 
## lowest : Accountants and auditors                                             Accounting and bookkeeping clerks                                    Administrative secretaries and related associate professionals       Advertising and public relations managers                            Agricultural or industrial machinery mechanics and fitters          
## highest: Wood products machine operators                                      Wood treaters                                                        Woodworking machine setters and setter-operators                     Word processor and related operators                                 Workers reporting occupations unidentifiable or inadequately defined
## --------------------------------------------------------------------------------
## Household.Head.Class.of.Worker 
##        n  missing distinct 
##    34008     7536        7 
## 
## lowest : Employer in own family-operated farm or business           Self-employed wihout any employee                          Worked for government/government corporation               Worked for private establishment                           Worked for private household                              
## highest: Worked for government/government corporation               Worked for private establishment                           Worked for private household                               Worked with pay in own family-operated farm or business    Worked without pay in own family-operated farm or business
## --------------------------------------------------------------------------------
## Type.of.Household 
##        n  missing distinct 
##    41544        0        3 
## 
## Extended Family (12932, 0.311), Single Family (28445, 0.685), Two or More
## Nonrelated Persons/Members (167, 0.004)
## --------------------------------------------------------------------------------
## Total.Number.of.Family.members 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0       21     0.98    4.635    2.489        1        2 
##      .25      .50      .75      .90      .95 
##        3        4        6        8        9 
## 
## lowest :  1  2  3  4  5, highest: 17 18 19 20 26
## --------------------------------------------------------------------------------
## Members.with.age.less.than.5.year.old 
##        n  missing distinct     Info     Mean      Gmd 
##    41544        0        6    0.658   0.4102   0.6146 
## 
## lowest : 0 1 2 3 4, highest: 1 2 3 4 5
##                                               
## Value          0     1     2     3     4     5
## Frequency  28705  9317  2933   511    64    14
## Proportion 0.691 0.224 0.071 0.012 0.002 0.000
## --------------------------------------------------------------------------------
## Members.with.age.5...17.years.old 
##        n  missing distinct     Info     Mean      Gmd 
##    41544        0        9     0.93    1.363    1.495 
## 
## lowest : 0 1 2 3 4, highest: 4 5 6 7 8
##                                                                 
## Value          0     1     2     3     4     5     6     7     8
## Frequency  14802 10445  8111  4704  2152   896   318    96    20
## Proportion 0.356 0.251 0.195 0.113 0.052 0.022 0.008 0.002 0.000
## --------------------------------------------------------------------------------
## Total.number.of.family.members.employed 
##        n  missing distinct     Info     Mean      Gmd 
##    41544        0        9    0.917    1.273    1.209 
## 
## lowest : 0 1 2 3 4, highest: 4 5 6 7 8
##                                                                 
## Value          0     1     2     3     4     5     6     7     8
## Frequency  11494 15312  9303  3579  1280   415   116    33    12
## Proportion 0.277 0.369 0.224 0.086 0.031 0.010 0.003 0.001 0.000
## --------------------------------------------------------------------------------
## Type.of.Building.House 
##        n  missing distinct 
##    41544        0        6 
## 
## lowest : Commercial/industrial/agricultural building Duplex                                      Institutional living quarter                Multi-unit residential                      Other building unit (e.g. cave, boat)      
## highest: Duplex                                      Institutional living quarter                Multi-unit residential                      Other building unit (e.g. cave, boat)       Single house                               
## 
## Commercial/industrial/agricultural building (51, 0.001), Duplex (1084, 0.026),
## Institutional living quarter (9, 0.000), Multi-unit residential (1329, 0.032),
## Other building unit (e.g. cave, boat) (2, 0.000), Single house (39069, 0.940)
## --------------------------------------------------------------------------------
## Type.of.Roof 
##        n  missing distinct 
##    41544        0        7 
## 
## lowest : Light material (cogon,nipa,anahaw)                                     Mixed but predominantly light materials                                Mixed but predominantly salvaged materials                             Mixed but predominantly strong materials                               Not Applicable                                                        
## highest: Mixed but predominantly salvaged materials                             Mixed but predominantly strong materials                               Not Applicable                                                         Salvaged/makeshift materials                                           Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos)
## --------------------------------------------------------------------------------
## Type.of.Walls 
##        n  missing distinct 
##    41544        0        6 
## 
## lowest : Light          NOt applicable Quite Strong   Salvaged       Strong        
## highest: NOt applicable Quite Strong   Salvaged       Strong         Very Light    
##                                                                       
## Value               Light NOt applicable   Quite Strong       Salvaged
## Frequency            8267             12           3487            456
## Proportion          0.199          0.000          0.084          0.011
##                                         
## Value              Strong     Very Light
## Frequency           27739           1583
## Proportion          0.668          0.038
## --------------------------------------------------------------------------------
## House.Floor.Area 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0      313    0.999     55.6    46.87       12       16 
##      .25      .50      .75      .90      .95 
##       25       40       70      100      150 
## 
## lowest :   5   6   7   8   9, highest: 820 840 868 900 998
## --------------------------------------------------------------------------------
## House.Age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0      111    0.999    20.13    15.23        2        5 
##      .25      .50      .75      .90      .95 
##       10       17       26       39       47 
## 
## lowest :   0   1   2   3   4, highest: 120 132 135 150 200
## --------------------------------------------------------------------------------
## Number.of.bedrooms 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0       10    0.911    1.788    1.162        0        1 
##      .25      .50      .75      .90      .95 
##        1        2        2        3        4 
## 
## lowest : 0 1 2 3 4, highest: 5 6 7 8 9
##                                                                       
## Value          0     1     2     3     4     5     6     7     8     9
## Frequency   3930 13431 15456  6111  1875   484   169    46    29    13
## Proportion 0.095 0.323 0.372 0.147 0.045 0.012 0.004 0.001 0.001 0.000
## --------------------------------------------------------------------------------
## Tenure.Status 
##        n  missing distinct 
##    41544        0        8 
## 
## lowest : Not Applicable                                    Own house, rent lot                               Own house, rent-free lot with consent of owner    Own house, rent-free lot without consent of owner Own or owner-like possession of house and lot    
## highest: Own house, rent-free lot without consent of owner Own or owner-like possession of house and lot     Rent house/room including lot                     Rent-free house and lot with consent of owner     Rent-free house and lot without consent of owner 
## --------------------------------------------------------------------------------
## Toilet.Facilities 
##        n  missing distinct 
##    41544        0        8 
## 
## lowest : Closed pit                                                     None                                                           Open pit                                                       Others                                                         Water-sealed, other depository, shared with other household   
## highest: Others                                                         Water-sealed, other depository, shared with other household    Water-sealed, other depository, used exclusively by household  Water-sealed, sewer septic tank, shared with other household   Water-sealed, sewer septic tank, used exclusively by household
## --------------------------------------------------------------------------------
## Electricity 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##    41544        0        2    0.292    37008   0.8908   0.1945 
## 
## --------------------------------------------------------------------------------
## Main.Source.of.Water.Supply 
##        n  missing distinct 
##    41544        0       11 
## 
## lowest : Dug well                                Lake, river, rain and others            Others                                  Own use, faucet, community water system Own use, tubed/piped deep well         
## highest: Protected spring, river, stream, etc    Shared, faucet, community water system  Shared, tubed/piped deep well           Tubed/piped shallow well                Unprotected spring, river, stream, etc 
## --------------------------------------------------------------------------------
## Number.of.Television 
##        n  missing distinct     Info     Mean      Gmd 
##    41544        0        7    0.705   0.8569   0.5956 
## 
## lowest : 0 1 2 3 4, highest: 2 3 4 5 6
##                                                     
## Value          0     1     2     3     4     5     6
## Frequency  10717 27089  2955   597   133    42    11
## Proportion 0.258 0.652 0.071 0.014 0.003 0.001 0.000
## --------------------------------------------------------------------------------
## Number.of.CD.VCD.DVD 
##        n  missing distinct     Info     Mean      Gmd 
##    41544        0        6    0.735   0.4352   0.5375 
## 
## lowest : 0 1 2 3 4, highest: 1 2 3 4 5
##                                               
## Value          0     1     2     3     4     5
## Frequency  24621 15983   752   163    20     5
## Proportion 0.593 0.385 0.018 0.004 0.000 0.000
## --------------------------------------------------------------------------------
## Number.of.Component.Stereo.set 
##        n  missing distinct     Info     Mean      Gmd 
##    41544        0        6    0.396   0.1621   0.2755 
## 
## lowest : 0 1 2 3 4, highest: 1 2 3 4 5
##                                               
## Value          0     1     2     3     4     5
## Frequency  35058  6284   174    13    10     5
## Proportion 0.844 0.151 0.004 0.000 0.000 0.000
## --------------------------------------------------------------------------------
## Number.of.Refrigerator.Freezer 
##        n  missing distinct     Info     Mean      Gmd 
##    41544        0        6    0.709   0.3942   0.5075 
## 
## lowest : 0 1 2 3 4, highest: 1 2 3 4 5
##                                               
## Value          0     1     2     3     4     5
## Frequency  25990 14881   569    73    17    14
## Proportion 0.626 0.358 0.014 0.002 0.000 0.000
## --------------------------------------------------------------------------------
## Number.of.Washing.Machine 
##        n  missing distinct     Info     Mean      Gmd 
##    41544        0        4    0.648   0.3198   0.4419 
##                                   
## Value          0     1     2     3
## Frequency  28484 12845   204    11
## Proportion 0.686 0.309 0.005 0.000
## --------------------------------------------------------------------------------
## Number.of.Airconditioner 
##        n  missing distinct     Info     Mean      Gmd 
##    41544        0        6    0.267   0.1298   0.2392 
## 
## lowest : 0 1 2 3 4, highest: 1 2 3 4 5
##                                               
## Value          0     1     2     3     4     5
## Frequency  37457  3178   622   199    66    22
## Proportion 0.902 0.076 0.015 0.005 0.002 0.001
## --------------------------------------------------------------------------------
## Number.of.Car..Jeep..Van 
##        n  missing distinct     Info     Mean      Gmd 
##    41544        0        6     0.18  0.08122   0.1538 
## 
## lowest : 0 1 2 3 4, highest: 1 2 3 4 5
##                                               
## Value          0     1     2     3     4     5
## Frequency  38876  2136   413    77    29    13
## Proportion 0.936 0.051 0.010 0.002 0.001 0.000
## --------------------------------------------------------------------------------
## Number.of.Landline.wireless.telephones 
##        n  missing distinct     Info     Mean      Gmd 
##    41544        0        5    0.153  0.06061   0.1154 
## 
## lowest : 0 1 2 3 4, highest: 0 1 2 3 4
##                                         
## Value          0     1     2     3     4
## Frequency  39302  2070    96    48    28
## Proportion 0.946 0.050 0.002 0.001 0.001
## --------------------------------------------------------------------------------
## Number.of.Cellular.phone 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    41544        0       11    0.949    1.906    1.646        0        0 
##      .25      .50      .75      .90      .95 
##        1        2        3        4        5 
## 
## lowest :  0  1  2  3  4, highest:  6  7  8  9 10
##                                                                             
## Value          0     1     2     3     4     5     6     7     8     9    10
## Frequency   6939 12484 10377  5820  3281  1467   666   242   153    49    66
## Proportion 0.167 0.301 0.250 0.140 0.079 0.035 0.016 0.006 0.004 0.001 0.002
## --------------------------------------------------------------------------------
## Number.of.Personal.Computer 
##        n  missing distinct     Info     Mean      Gmd 
##    41544        0        7    0.497    0.315   0.5339 
## 
## lowest : 0 1 2 3 4, highest: 2 3 4 5 6
##                                                     
## Value          0     1     2     3     4     5     6
## Frequency  32988  5650  1836   667   271   112    20
## Proportion 0.794 0.136 0.044 0.016 0.007 0.003 0.000
## --------------------------------------------------------------------------------
## Number.of.Stove.with.Oven.Gas.Range 
##        n  missing distinct     Info     Mean      Gmd 
##    41544        0        4    0.342    0.135   0.2357 
##                                   
## Value          0     1     2     3
## Frequency  36101  5287   145    11
## Proportion 0.869 0.127 0.003 0.000
## --------------------------------------------------------------------------------
## Number.of.Motorized.Banca 
##        n  missing distinct     Info     Mean      Gmd 
##    41544        0        4    0.035  0.01312  0.02596 
##                                   
## Value          0     1     2     3
## Frequency  41055   444    34    11
## Proportion 0.988 0.011 0.001 0.000
## --------------------------------------------------------------------------------
## Number.of.Motorcycle.Tricycle 
##        n  missing distinct     Info     Mean      Gmd 
##    41544        0        6    0.564   0.2899   0.4552 
## 
## lowest : 0 1 2 3 4, highest: 1 2 3 4 5
##                                               
## Value          0     1     2     3     4     5
## Frequency  31282  8811  1199   186    54    12
## Proportion 0.753 0.212 0.029 0.004 0.001 0.000
## --------------------------------------------------------------------------------
# ----- Coeficiente de simetria de cada una de las variables numéricas -----

nums <- datos %>%
  select_if(is.numeric)
skewness(nums)
##                        Total.Household.Income 
##                                     8.8963098 
##                        Total.Food.Expenditure 
##                                     2.2309606 
##              Agricultural.Household.indicator 
##                                     1.2857076 
##                 Bread.and.Cereals.Expenditure 
##                                     7.0110325 
##                        Total.Rice.Expenditure 
##                                     8.9897100 
##                              Meat.Expenditure 
##                                     2.6044671 
##   Total.Fish.and..marine.products.Expenditure 
##                                     2.8673929 
##                             Fruit.Expenditure 
##                                    21.6962949 
##                        Vegetables.Expenditure 
##                                     2.5142803 
##             Restaurant.and.hotels.Expenditure 
##                                     5.7407231 
##               Alcoholic.Beverages.Expenditure 
##                                     5.9003524 
##                           Tobacco.Expenditure 
##                                     4.1225588 
## Clothing..Footwear.and.Other.Wear.Expenditure 
##                                     8.3542849 
##                 Housing.and.water.Expenditure 
##                                     9.7024646 
##                    Imputed.House.Rental.Value 
##                                    13.5800101 
##                      Medical.Care.Expenditure 
##                                    15.0488827 
##                    Transportation.Expenditure 
##                                     8.5767093 
##                     Communication.Expenditure 
##                                     4.2437244 
##                         Education.Expenditure 
##                                     8.7911722 
##  Miscellaneous.Goods.and.Services.Expenditure 
##                                     6.0453470 
##                 Special.Occasions.Expenditure 
##                                     9.5947743 
##           Crop.Farming.and.Gardening.expenses 
##                                    23.3872787 
##  Total.Income.from.Entrepreneurial.Acitivites 
##                                    19.7165572 
##                            Household.Head.Age 
##                                     0.2369655 
##                Total.Number.of.Family.members 
##                                     0.8668230 
##         Members.with.age.less.than.5.year.old 
##                                     1.7905377 
##             Members.with.age.5...17.years.old 
##                                     1.0535841 
##       Total.number.of.family.members.employed 
##                                     1.0720297 
##                              House.Floor.Area 
##                                     4.3806605 
##                                     House.Age 
##                                     1.3687074 
##                            Number.of.bedrooms 
##                                     0.8778295 
##                                   Electricity 
##                                    -2.5062518 
##                          Number.of.Television 
##                                     1.0914439 
##                          Number.of.CD.VCD.DVD 
##                                     1.0769677 
##                Number.of.Component.Stereo.set 
##                                     2.4742277 
##                Number.of.Refrigerator.Freezer 
##                                     1.1702791 
##                     Number.of.Washing.Machine 
##                                     0.9427084 
##                      Number.of.Airconditioner 
##                                     4.5715571 
##                      Number.of.Car..Jeep..Van 
##                                     5.6337177 
##        Number.of.Landline.wireless.telephones 
##                                     6.0636086 
##                      Number.of.Cellular.phone 
##                                     1.2011583 
##                   Number.of.Personal.Computer 
##                                     3.0470314 
##           Number.of.Stove.with.Oven.Gas.Range 
##                                     2.4572737 
##                     Number.of.Motorized.Banca 
##                                    11.5458788 
##                 Number.of.Motorcycle.Tricycle 
##                                     2.2262507
cat <- datos %>%
  select_if(is.factor)

A la luz de la escasa documentación referida al conjunto de datos, ha sido imposible descifrar el significado de algunas variables (por ejemplo, Agricultural.Household.indicator). Por ello, se decide eliminar aquellas cuya interpretación es desconocida.

# ----- Eliminación de variables del dataset -----

datos<-datos%>%select(-Agricultural.Household.indicator,-Members.with.age.less.than.5.year.old,-Members.with.age.5...17.years.old
                      ,-Household.Head.Occupation)

Una vez descartadas aquellas variables, se irán etiquetando como NA todos aquellos valores considerados erróneos o no recogidos (missing values). Estos vendrán normalmente etiquetados por unknown, not applicable o 0. Sin embargo, en este último caso es necesario tener cuidado, ya que algunas variables pueden tomar valor 0 y esto ser correcto, debido al tipo de datos que son (valores socio-económicos).

Además, se categorizarán ciertas variables, seleccionando las posibles categorías que podrán adquirir.

# -----Corrección de valores en variables y categorización -----

levels(datos$Main.Source.of.Income)
## [1] "Enterpreneurial Activities" "Other sources of Income"   
## [3] "Wage/Salaries"
summary(datos$Main.Source.of.Income)
## Enterpreneurial Activities    Other sources of Income 
##                      10320                      10836 
##              Wage/Salaries 
##                      20388
datos$Main.Source.of.Income = factor(datos$Main.Source.of.Income,ordered=TRUE,levels=(c('Other sources of Income'
                                                                            , 'Enterpreneurial Activities'
                                                                            , 'Wage/Salaries')))
levels(datos$Main.Source.of.Income)
## [1] "Other sources of Income"    "Enterpreneurial Activities"
## [3] "Wage/Salaries"
#--------------------------------------------------

levels(datos$Household.Head.Marital.Status)
## [1] "Annulled"           "Divorced/Separated" "Married"           
## [4] "Single"             "Unknown"            "Widowed"
summary(datos$Household.Head.Marital.Status)
##           Annulled Divorced/Separated            Married             Single 
##                 11               1425              31347               1942 
##            Unknown            Widowed 
##                  1               6818
datos$Household.Head.Marital.Status[which(datos$Household.Head.Marital.Status=='Unknown')] <-NA # Se etiqueta como NA el valor "Unknown" (desconocido)
datos$Household.Head.Marital.Status<-fct_drop(datos$Household.Head.Marital.Status)
levels(datos$Household.Head.Marital.Status)
## [1] "Annulled"           "Divorced/Separated" "Married"           
## [4] "Single"             "Widowed"
datos$Household.Head.Marital.Status = 
  factor(datos$Household.Head.Marital.Status,ordered=TRUE,levels=
           (c('Single'
              ,'Widowed'
              ,'Annulled'
              ,'Divorced/Separated'
              ,'Married')))
levels(datos$Household.Head.Marital.Status)
## [1] "Single"             "Widowed"            "Annulled"          
## [4] "Divorced/Separated" "Married"
#---------------------------------------------------------------------------------------------------------------------------------------------

levels(datos$Household.Head.Class.of.Worker)
## [1] "Employer in own family-operated farm or business"          
## [2] "Self-employed wihout any employee"                         
## [3] "Worked for government/government corporation"              
## [4] "Worked for private establishment"                          
## [5] "Worked for private household"                              
## [6] "Worked with pay in own family-operated farm or business"   
## [7] "Worked without pay in own family-operated farm or business"
summary(datos$Household.Head.Class.of.Worker)
##           Employer in own family-operated farm or business 
##                                                       2581 
##                          Self-employed wihout any employee 
##                                                      13766 
##               Worked for government/government corporation 
##                                                       2820 
##                           Worked for private establishment 
##                                                      13731 
##                               Worked for private household 
##                                                        811 
##    Worked with pay in own family-operated farm or business 
##                                                         14 
## Worked without pay in own family-operated farm or business 
##                                                        285 
##                                                       NA's 
##                                                       7536
datos$Household.Head.Class.of.Worker = 
  factor(datos$Household.Head.Class.of.Worker,ordered=TRUE,levels=
           (c('Worked without pay in own family-operated farm or business'
              ,'Employer in own family-operated farm or business'
              ,'Worked with pay in own family-operated farm or business'
              ,'Self-employed wihout any employee'
              ,'Worked for private household'
              ,'Worked for private establishment'
              ,'Worked for government/government corporation')))
levels(datos$Household.Head.Class.of.Worker)
## [1] "Worked without pay in own family-operated farm or business"
## [2] "Employer in own family-operated farm or business"          
## [3] "Worked with pay in own family-operated farm or business"   
## [4] "Self-employed wihout any employee"                         
## [5] "Worked for private household"                              
## [6] "Worked for private establishment"                          
## [7] "Worked for government/government corporation"
#---------------------------------------------------------------------------------------------------------------------------------------------

levels(datos$Type.of.Household)
## [1] "Extended Family"                       
## [2] "Single Family"                         
## [3] "Two or More Nonrelated Persons/Members"
datos$Type.of.Household = 
  factor(datos$Type.of.Household,ordered=TRUE,levels=
           (c('Single Family'
              ,'Two or More Nonrelated Persons/Members'
              ,'Extended Family')))
levels(datos$Type.of.Household)
## [1] "Single Family"                         
## [2] "Two or More Nonrelated Persons/Members"
## [3] "Extended Family"
#---------------------------------------------------------------------------------------------------------------------------------------------

levels(datos$Type.of.Building.House)
## [1] "Commercial/industrial/agricultural building"
## [2] "Duplex"                                     
## [3] "Institutional living quarter"               
## [4] "Multi-unit residential"                     
## [5] "Other building unit (e.g. cave, boat)"      
## [6] "Single house"
datos$Type.of.Building.House = 
  factor(datos$Type.of.Building.House,ordered=TRUE,levels=
           (c('Other building unit (e.g. cave, boat)'
              ,'Institutional living quarter'
              ,'Commercial/industrial/agricultural building'
              ,'Single house'
              ,'Duplex'
              ,'Multi-unit residential')))
levels(datos$Type.of.Building.House)
## [1] "Other building unit (e.g. cave, boat)"      
## [2] "Institutional living quarter"               
## [3] "Commercial/industrial/agricultural building"
## [4] "Single house"                               
## [5] "Duplex"                                     
## [6] "Multi-unit residential"
#---------------------------------------------------------------------------------------------------------------------------------------------

levels(datos$Type.of.Roof)
## [1] "Light material (cogon,nipa,anahaw)"                                    
## [2] "Mixed but predominantly light materials"                               
## [3] "Mixed but predominantly salvaged materials"                            
## [4] "Mixed but predominantly strong materials"                              
## [5] "Not Applicable"                                                        
## [6] "Salvaged/makeshift materials"                                          
## [7] "Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos)"
summary(datos$Type.of.Roof)
##                                     Light material (cogon,nipa,anahaw) 
##                                                                   5074 
##                                Mixed but predominantly light materials 
##                                                                    846 
##                             Mixed but predominantly salvaged materials 
##                                                                     56 
##                               Mixed but predominantly strong materials 
##                                                                   2002 
##                                                         Not Applicable 
##                                                                     12 
##                                           Salvaged/makeshift materials 
##                                                                    212 
## Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos) 
##                                                                  33342
datos$Type.of.Roof[which(datos$Type.of.Roof=='Not Applicable')] <-NA # Se etiqueta como NA el valor "Not Applicable" (no aplicable)
datos$Type.of.Roof<-fct_drop(datos$Type.of.Roof)
levels(datos$Type.of.Roof)
## [1] "Light material (cogon,nipa,anahaw)"                                    
## [2] "Mixed but predominantly light materials"                               
## [3] "Mixed but predominantly salvaged materials"                            
## [4] "Mixed but predominantly strong materials"                              
## [5] "Salvaged/makeshift materials"                                          
## [6] "Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos)"
summary(datos$Type.of.Roof)
##                                     Light material (cogon,nipa,anahaw) 
##                                                                   5074 
##                                Mixed but predominantly light materials 
##                                                                    846 
##                             Mixed but predominantly salvaged materials 
##                                                                     56 
##                               Mixed but predominantly strong materials 
##                                                                   2002 
##                                           Salvaged/makeshift materials 
##                                                                    212 
## Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos) 
##                                                                  33342 
##                                                                   NA's 
##                                                                     12
datos$Type.of.Roof = 
  factor(datos$Type.of.Roof,ordered=TRUE,levels=
           (c('Salvaged/makeshift materials'
              ,'Light material (cogon,nipa,anahaw)'
              ,'Mixed but predominantly salvaged materials'
              ,'Mixed but predominantly light materials'
              ,'Mixed but predominantly strong materials'
              ,'Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos)')))
levels(datos$Type.of.Roof)
## [1] "Salvaged/makeshift materials"                                          
## [2] "Light material (cogon,nipa,anahaw)"                                    
## [3] "Mixed but predominantly salvaged materials"                            
## [4] "Mixed but predominantly light materials"                               
## [5] "Mixed but predominantly strong materials"                              
## [6] "Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos)"
#---------------------------------------------------------------------------------------------------------------------------------------------

levels(datos$Type.of.Walls)
## [1] "Light"          "NOt applicable" "Quite Strong"   "Salvaged"      
## [5] "Strong"         "Very Light"
summary(datos$Type.of.Walls)
##          Light NOt applicable   Quite Strong       Salvaged         Strong 
##           8267             12           3487            456          27739 
##     Very Light 
##           1583
datos$Type.of.Walls[which(datos$Type.of.Walls=='Not applicable')] <-NA # Se etiqueta como NA el valor "Not Applicable" (no aplicable)
datos$Type.of.Walls<-fct_drop(datos$Type.of.Walls)
levels(datos$Type.of.Walls)
## [1] "Light"          "NOt applicable" "Quite Strong"   "Salvaged"      
## [5] "Strong"         "Very Light"
summary(datos$Type.of.Walls)
##          Light NOt applicable   Quite Strong       Salvaged         Strong 
##           8267             12           3487            456          27739 
##     Very Light 
##           1583
datos$Type.of.Walls= 
  factor(datos$Type.of.Walls,ordered=TRUE,levels=
           (c('Salvaged'
              ,'Very Light'
              ,'Light'
              ,'Strong'
              ,'Quite Strong')))
levels(datos$Type.of.Walls)
## [1] "Salvaged"     "Very Light"   "Light"        "Strong"       "Quite Strong"
#---------------------------------------------------------------------------------------------------------------------------------------------
levels(datos$Toilet.Facilities)
## [1] "Closed pit"                                                    
## [2] "None"                                                          
## [3] "Open pit"                                                      
## [4] "Others"                                                        
## [5] "Water-sealed, other depository, shared with other household"   
## [6] "Water-sealed, other depository, used exclusively by household" 
## [7] "Water-sealed, sewer septic tank, shared with other household"  
## [8] "Water-sealed, sewer septic tank, used exclusively by household"
summary(datos$Toilet.Facilities)
##                                                     Closed pit 
##                                                           2273 
##                                                           None 
##                                                           1580 
##                                                       Open pit 
##                                                           1189 
##                                                         Others 
##                                                            353 
##    Water-sealed, other depository, shared with other household 
##                                                            950 
##  Water-sealed, other depository, used exclusively by household 
##                                                           2343 
##   Water-sealed, sewer septic tank, shared with other household 
##                                                           3694 
## Water-sealed, sewer septic tank, used exclusively by household 
##                                                          29162
datos$Toilet.Facilities= 
  factor(datos$Toilet.Facilities,ordered=TRUE,levels=
           (c('None'
              ,'Others'
              ,'Open pit'
              ,'Closed pit'
              ,'Water-sealed, other depository, shared with other household'
              ,'Water-sealed, other depository, used exclusively by household'
              ,'Water-sealed, sewer septic tank, shared with other household'
              ,'Water-sealed, sewer septic tank, used exclusively by household')))
levels(datos$Toilet.Facilities)
## [1] "None"                                                          
## [2] "Others"                                                        
## [3] "Open pit"                                                      
## [4] "Closed pit"                                                    
## [5] "Water-sealed, other depository, shared with other household"   
## [6] "Water-sealed, other depository, used exclusively by household" 
## [7] "Water-sealed, sewer septic tank, shared with other household"  
## [8] "Water-sealed, sewer septic tank, used exclusively by household"
#---------------------------------------------------------------------------------------------------------------------------------------------
levels(datos$Household.Head.Occupation)
## NULL
#---------------------------------------------------------------------------------------------------------------------------------------------
levels(datos$Main.Source.of.Water.Supply)
##  [1] "Dug well"                               
##  [2] "Lake, river, rain and others"           
##  [3] "Others"                                 
##  [4] "Own use, faucet, community water system"
##  [5] "Own use, tubed/piped deep well"         
##  [6] "Peddler"                                
##  [7] "Protected spring, river, stream, etc"   
##  [8] "Shared, faucet, community water system" 
##  [9] "Shared, tubed/piped deep well"          
## [10] "Tubed/piped shallow well"               
## [11] "Unprotected spring, river, stream, etc"
datos$Main.Source.of.Water.Supply= 
  factor(datos$Main.Source.of.Water.Supply,ordered=TRUE,levels=
           (c('Others'
              ,'Dug well'
              ,'Lake, river, rain and others'
              ,'Unprotected spring, river, stream, etc'
              ,'Protected spring, river, stream, etc'
              ,'Tubed/piped shallow well'
              ,'Shared, tubed/piped deep well'
              ,'Own use, tubed/piped deep well'
              ,'Peddler'
              ,'Shared, faucet, community water system'
              ,'Own use, faucet, community water system')))
levels(datos$Main.Source.of.Water.Supply)
##  [1] "Others"                                 
##  [2] "Dug well"                               
##  [3] "Lake, river, rain and others"           
##  [4] "Unprotected spring, river, stream, etc" 
##  [5] "Protected spring, river, stream, etc"   
##  [6] "Tubed/piped shallow well"               
##  [7] "Shared, tubed/piped deep well"          
##  [8] "Own use, tubed/piped deep well"         
##  [9] "Peddler"                                
## [10] "Shared, faucet, community water system" 
## [11] "Own use, faucet, community water system"
#---------------------------------------------------------------------------------------------------------------------------------------------
levels(datos$Tenure.Status)
## [1] "Not Applicable"                                   
## [2] "Own house, rent lot"                              
## [3] "Own house, rent-free lot with consent of owner"   
## [4] "Own house, rent-free lot without consent of owner"
## [5] "Own or owner-like possession of house and lot"    
## [6] "Rent house/room including lot"                    
## [7] "Rent-free house and lot with consent of owner"    
## [8] "Rent-free house and lot without consent of owner"
datos$Tenure.Status[which(datos$Tenure.Status=='Not Applicable')] <-NA # Se etiqueta como NA el valor "Not Applicable" (no aplicable)
datos$Tenure.Status<-fct_drop(datos$Tenure.Status)
levels(datos$Tenure.Status)
## [1] "Own house, rent lot"                              
## [2] "Own house, rent-free lot with consent of owner"   
## [3] "Own house, rent-free lot without consent of owner"
## [4] "Own or owner-like possession of house and lot"    
## [5] "Rent house/room including lot"                    
## [6] "Rent-free house and lot with consent of owner"    
## [7] "Rent-free house and lot without consent of owner"
summary(datos$Tenure.Status)
##                               Own house, rent lot 
##                                               425 
##    Own house, rent-free lot with consent of owner 
##                                              6165 
## Own house, rent-free lot without consent of owner 
##                                               995 
##     Own or owner-like possession of house and lot 
##                                             29541 
##                     Rent house/room including lot 
##                                              2203 
##     Rent-free house and lot with consent of owner 
##                                              2014 
##  Rent-free house and lot without consent of owner 
##                                               128 
##                                              NA's 
##                                                73
#---------------------------------------------------------------------------------------------------------------------------------------------

levels(datos$Electricity) 
## NULL
summary(datos$Electricity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  1.0000  1.0000  0.8908  1.0000  1.0000
ggplot(datos, aes(x=Number.of.Airconditioner,fill= Electricity)) + geom_bar(position = "dodge")

La variable “Electricity” es un claro ejemplo de la importancia de no tratar como NA valores iguales a 0. Puesto que no existe ninguna descripción de las variables del dataset, más allá del propio nombre, se trata de ver a qué se refieren esos 0. A la vista de la gráfica, se concluye que todos los usuarios que tienen aire acondicionado, tienen un 1 en Electricity, y que ningún usuario con un 0 tiene aire acondicionado, por lo que es posible afirmar que el 1 corresponde a tener electricidad, y el 0 a no tenerla.

Para clarificar, será categorizada con valores de “Si” y “No”, que sustituirán a los unos y ceros, respectivamente.

# Sustitución de 0/1 por No/Si

datos$Electricity[which(datos$Electricity=='0')] <- 'No'
datos$Electricity[which(datos$Electricity=='1')] <- 'Si'


# Transformación de la variable Electricity a categórica, por ser binaria (0 o 1 / No o Si)


datos$Electricity<-as.factor(datos$Electricity)

datos$Electricity = 
  factor(datos$Electricity,ordered=TRUE,levels=
           (c('No','Si')))

El último grupo de variables, corresponde al número de bienes adquiridos. Dichas variables son marcadas como numéricas, pero sus rangos son muy reducidos con respecto a las demás numéricas. Se inspecciona más a fondo estas variables viendo la media por familia filipina de los diferentes bienes.

#---------------------------------------------------------------------------------------------------------------------------------------------

summary(datos$Number.of.bedrooms)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   1.788   2.000   9.000
#---------------------------------------------------------------------------------------------------------------------------------------------

summary(datos$Number.of.Refrigerator.Freezer)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3942  1.0000  5.0000
#---------------------------------------------------------------------------------------------------------------------------------------------

summary(datos$Number.of.Washing.Machine)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3198  1.0000  3.0000
#---------------------------------------------------------------------------------------------------------------------------------------------

summary(datos$Number.of.Airconditioner)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1298  0.0000  5.0000
#---------------------------------------------------------------------------------------------------------------------------------------------

summary(datos$Number.of.Car..Jeep..Van)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08121 0.00000 5.00000
#---------------------------------------------------------------------------------------------------------------------------------------------

summary(datos$Number.of.CD.VCD.DVD)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4352  1.0000  5.0000
#---------------------------------------------------------------------------------------------------------------------------------------------

summary(datos$Number.of.Cellular.phone)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   1.906   3.000  10.000
#---------------------------------------------------------------------------------------------------------------------------------------------

summary(datos$Number.of.Component.Stereo.set)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1621  0.0000  5.0000
#---------------------------------------------------------------------------------------------------------------------------------------------

summary(datos$Number.of.Landline.wireless.telephones)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06061 0.00000 4.00000
#---------------------------------------------------------------------------------------------------------------------------------------------

summary(datos$Number.of.Personal.Computer)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.315   0.000   6.000
#---------------------------------------------------------------------------------------------------------------------------------------------

summary(datos$Number.of.Motorcycle.Tricycle)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2899  0.0000  5.0000
#---------------------------------------------------------------------------------------------------------------------------------------------

summary(datos$Number.of.Stove.with.Oven.Gas.Range)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.135   0.000   3.000
#---------------------------------------------------------------------------------------------------------------------------------------------

summary(datos$Number.of.Television)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.8569  1.0000  6.0000
#---------------------------------------------------------------------------------------------------------------------------------------------

summary(datos$Number.of.Motorized.Banca)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01312 0.00000 3.00000
#---------------------------------------------------------------------------------------------------------------------------------------------

Una vez están los datos ordenados, y debido a que el volumen de la muestra inicial podría ser un problema al tratar con ellos, se procede a realizar un muestreo de 10000 observaciones con muestreo aleatorio simple, fijando una semilla aleatoria.

Se divide la muestra de 10000 observaciones en dos conjuntos: uno de train y otro de test/validación (70%-30%). Se trabajará con el conjunto de train, mientras que el de test será reservado para la parte final (evaluación del modelo).

# ----- Creación de una muestra del conjunto inicial de datos con muestreo aleatorio simple sin reemplazamiento -----

set.seed(300)
datos_s <- datos %>%
  sample_n(size=10000,replace=FALSE)

# División de la muestra de 10000 observaciones en dos conjuntos: uno de train y otro de test (70%-30%)

training <- createDataPartition(pull(datos_s, Total.Household.Income ),
                                p = 0.7, list = FALSE, times = 1)

datos_training <- slice(datos_s, training)
datos_testing <- slice(datos_s, -training)

var_train_cat <- datos_training%>%select_if(is.factor)
var_train_num <- datos_training%>%select_if(is.numeric)

4.2. Análisis variables cualitativas

Para estudiar más a fondo las variables cualitativas, es conveniente ver sus frecuencias absolutas, una a una, con ayuda de la función table()

# ----- Frecuencias absolutas y relativas ------

# Frecuencias absolutas - función table() (tabla de contingencia)

table(var_train_cat$Region)
## 
##                      ARMM                       CAR                    Caraga 
##                       375                       299                       299 
##         I - Ilocos Region       II - Cagayan Valley       III - Central Luzon 
##                       401                       367                       532 
##          IVA - CALABARZON            IVB - MIMAROPA IX - Zasmboanga Peninsula 
##                       658                       193                       319 
##                       NCR          V - Bicol Region      VI - Western Visayas 
##                       724                       412                       497 
##     VII - Central Visayas    VIII - Eastern Visayas     X - Northern Mindanao 
##                       416                       397                       296 
##         XI - Davao Region        XII - SOCCSKSARGEN 
##                       432                       383
table(var_train_cat$Main.Source.of.Income)
## 
##    Other sources of Income Enterpreneurial Activities 
##                       1790                       1771 
##              Wage/Salaries 
##                       3439
table(var_train_cat$Household.Head.Sex)
## 
## Female   Male 
##   1513   5487
table(var_train_cat$Household.Head.Marital.Status)
## 
##             Single            Widowed           Annulled Divorced/Separated 
##                319               1143                  4                223 
##            Married 
##               5311
table(var_train_cat$Household.Head.Job.or.Business.Indicator)
## 
##   No Job/Business With Job/Business 
##              1241              5759
table(var_train_cat$Household.Head.Class.of.Worker)
## 
## Worked without pay in own family-operated farm or business 
##                                                         50 
##           Employer in own family-operated farm or business 
##                                                        434 
##    Worked with pay in own family-operated farm or business 
##                                                          4 
##                          Self-employed wihout any employee 
##                                                       2304 
##                               Worked for private household 
##                                                        139 
##                           Worked for private establishment 
##                                                       2339 
##               Worked for government/government corporation 
##                                                        489
table(var_train_cat$Type.of.Household)
## 
##                          Single Family Two or More Nonrelated Persons/Members 
##                                   4790                                     24 
##                        Extended Family 
##                                   2186
table(var_train_cat$Type.of.Building.House)
## 
##       Other building unit (e.g. cave, boat) 
##                                           0 
##                Institutional living quarter 
##                                           2 
## Commercial/industrial/agricultural building 
##                                           7 
##                                Single house 
##                                        6584 
##                                      Duplex 
##                                         173 
##                      Multi-unit residential 
##                                         234
table(var_train_cat$Type.of.Roof)
## 
##                                           Salvaged/makeshift materials 
##                                                                     32 
##                                     Light material (cogon,nipa,anahaw) 
##                                                                    845 
##                             Mixed but predominantly salvaged materials 
##                                                                     11 
##                                Mixed but predominantly light materials 
##                                                                    133 
##                               Mixed but predominantly strong materials 
##                                                                    345 
## Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos) 
##                                                                   5633
table(var_train_cat$Type.of.Walls)
## 
##     Salvaged   Very Light        Light       Strong Quite Strong 
##           81          258         1366         4701          590
table(var_train_cat$Tenure.Status)
## 
##                               Own house, rent lot 
##                                                81 
##    Own house, rent-free lot with consent of owner 
##                                              1005 
## Own house, rent-free lot without consent of owner 
##                                               151 
##     Own or owner-like possession of house and lot 
##                                              5011 
##                     Rent house/room including lot 
##                                               389 
##     Rent-free house and lot with consent of owner 
##                                               336 
##  Rent-free house and lot without consent of owner 
##                                                20
table(var_train_cat$Toilet.Facilities)
## 
##                                                           None 
##                                                            255 
##                                                         Others 
##                                                             65 
##                                                       Open pit 
##                                                            206 
##                                                     Closed pit 
##                                                            361 
##    Water-sealed, other depository, shared with other household 
##                                                            133 
##  Water-sealed, other depository, used exclusively by household 
##                                                            428 
##   Water-sealed, sewer septic tank, shared with other household 
##                                                            646 
## Water-sealed, sewer septic tank, used exclusively by household 
##                                                           4906
table(var_train_cat$Electricity)
## 
##   No   Si 
##  753 6247
table(var_train_cat$Electricity)
## 
##   No   Si 
##  753 6247
table(var_train_cat$Main.Source.of.Water.Supply)
## 
##                                  Others                                Dug well 
##                                      16                                     663 
##            Lake, river, rain and others  Unprotected spring, river, stream, etc 
##                                      81                                     105 
##    Protected spring, river, stream, etc                Tubed/piped shallow well 
##                                     459                                     251 
##           Shared, tubed/piped deep well          Own use, tubed/piped deep well 
##                                    1012                                     775 
##                                 Peddler  Shared, faucet, community water system 
##                                     147                                     769 
## Own use, faucet, community water system 
##                                    2722
table(var_train_cat$Number.of.Motorcycle.Tricycle)
## < table of extent 0 >
table(var_train_cat$Household.Head.Highest.Grade.Completed)
## 
##                                                                                                                                                                         Agriculture, Forestry, and Fishery Programs 
##                                                                                                                                                                                                                  40 
##                                                                                                                                                                                  Architecture and Building Programs 
##                                                                                                                                                                                                                   6 
##                                                                                                                                                                                                       Arts Programs 
##                                                                                                                                                                                                                   4 
##                                                                                                                                                                                                      Basic Programs 
##                                                                                                                                                                                                                   6 
##                                                                                                                                                                                Business and Administration Programs 
##                                                                                                                                                                                                                 212 
##                                                                                                                                                                           Computing/Information Technology Programs 
##                                                                                                                                                                                                                  51 
##                                                                                                                                                                                                 Elementary Graduate 
##                                                                                                                                                                                                                1261 
##                                                                                                                                                                         Engineering and Engineering trades Programs 
##                                                                                                                                                                                                                  81 
##                                                                                                                                                                         Engineering and Engineering Trades Programs 
##                                                                                                                                                                                                                 147 
##                                                                                                                                                                                   Environmental Protection Programs 
##                                                                                                                                                                                                                   2 
##                                                                                                                                                                                                  First Year College 
##                                                                                                                                                                                                                 145 
##                                                                                                                                                                                              First Year High School 
##                                                                                                                                                                                                                 209 
##                                                                                                                                                                                           First Year Post Secondary 
##                                                                                                                                                                                                                  20 
##                                                                                                                                                                                                 Fourth Year College 
##                                                                                                                                                                                                                  17 
##                                                                                                                                                                                                             Grade 1 
##                                                                                                                                                                                                                 152 
##                                                                                                                                                                                                             Grade 2 
##                                                                                                                                                                                                                 263 
##                                                                                                                                                                                                             Grade 3 
##                                                                                                                                                                                                                 309 
##                                                                                                                                                                                                             Grade 4 
##                                                                                                                                                                                                                 366 
##                                                                                                                                                                                                             Grade 5 
##                                                                                                                                                                                                                 348 
##                                                                                                                                                                                                             Grade 6 
##                                                                                                                                                                                                                  49 
##                                                                                                                                                                                                     Health Programs 
##                                                                                                                                                                                                                  66 
##                                                                                                                                                                                                High School Graduate 
##                                                                                                                                                                                                                1665 
##                                                                                                                                                                                                 Humanities Programs 
##                                                                                                                                                                                                                   9 
##                                                                                                                                                                                 Journalism and Information Programs 
##                                                                                                                                                                                                                   7 
##                                                                                                                                                                                                        Law Programs 
##                                                                                                                                                                                                                   5 
##                                                                                                                                                                                              Life Sciences Programs 
##                                                                                                                                                                                                                   4 
##                                                                                                                                                                               Manufacturing and Processing Programs 
##                                                                                                                                                                                                                   3 
##                                                                                                                                                                                 Mathematics and Statistics Programs 
##                                                                                                                                                                                                                   2 
##                                                                                                                                                                                                  No Grade Completed 
##                                                                                                                                                                                                                 195 
##                                                        Other Programs in Education at the Third Level, First Stage, of the Type that Leads to an Award not Equivalent to a First University or Baccalaureate Degree 
##                                                                                                                                                                                                                  13 
## Other Programs of Education at the Third Level, First Stage, of the Type that Leads to a Baccalaureate or First University/Professional Degree (HIgher Education Level, First Stage, or Collegiate Education Level) 
##                                                                                                                                                                                                                   0 
##                                                                                                                                                                                          Personal Services Programs 
##                                                                                                                                                                                                                  22 
##                                                                                                                                                                                          Physical Sciences Programs 
##                                                                                                                                                                                                                   3 
##                                                                                                                                                                                                  Post Baccalaureate 
##                                                                                                                                                                                                                  39 
##                                                                                                                                                                                                           Preschool 
##                                                                                                                                                                                                                   3 
##                                                                                                                                                                                                 Second Year College 
##                                                                                                                                                                                                                 216 
##                                                                                                                                                                                             Second Year High School 
##                                                                                                                                                                                                                 363 
##                                                                                                                                                                                          Second Year Post Secondary 
##                                                                                                                                                                                                                  22 
##                                                                                                                                                                                          Security Services Programs 
##                                                                                                                                                                                                                  52 
##                                                                                                                                                                              Social and Behavioral Science Programs 
##                                                                                                                                                                                                                  23 
##                                                                                                                                                                                            Social Services Programs 
##                                                                                                                                                                                                                   0 
##                                                                                                                                                                    Teacher Training and Education Sciences Programs 
##                                                                                                                                                                                                                 159 
##                                                                                                                                                                                                  Third Year College 
##                                                                                                                                                                                                                 167 
##                                                                                                                                                                                              Third Year High School 
##                                                                                                                                                                                                                 234 
##                                                                                                                                                                                         Transport Services Programs 
##                                                                                                                                                                                                                  39 
##                                                                                                                                                                                                 Veterinary Programs 
##                                                                                                                                                                                                                   1

Seguidamente, será repetido el mismo proceso para ver las frecuencias relativas, esta vez utilizando la función prop.table()

# Frecuencias relativas - función prop.table

prop.table(table(var_train_cat$Region))
## 
##                      ARMM                       CAR                    Caraga 
##                0.05357143                0.04271429                0.04271429 
##         I - Ilocos Region       II - Cagayan Valley       III - Central Luzon 
##                0.05728571                0.05242857                0.07600000 
##          IVA - CALABARZON            IVB - MIMAROPA IX - Zasmboanga Peninsula 
##                0.09400000                0.02757143                0.04557143 
##                       NCR          V - Bicol Region      VI - Western Visayas 
##                0.10342857                0.05885714                0.07100000 
##     VII - Central Visayas    VIII - Eastern Visayas     X - Northern Mindanao 
##                0.05942857                0.05671429                0.04228571 
##         XI - Davao Region        XII - SOCCSKSARGEN 
##                0.06171429                0.05471429
prop.table(table(var_train_cat$Main.Source.of.Income))
## 
##    Other sources of Income Enterpreneurial Activities 
##                  0.2557143                  0.2530000 
##              Wage/Salaries 
##                  0.4912857
prop.table(table(var_train_cat$Household.Head.Sex))
## 
##    Female      Male 
## 0.2161429 0.7838571
prop.table(table(var_train_cat$Household.Head.Marital.Status))
## 
##             Single            Widowed           Annulled Divorced/Separated 
##       0.0455714286       0.1632857143       0.0005714286       0.0318571429 
##            Married 
##       0.7587142857
prop.table(table(var_train_cat$Household.Head.Job.or.Business.Indicator))
## 
##   No Job/Business With Job/Business 
##         0.1772857         0.8227143
prop.table(table(var_train_cat$Household.Head.Class.of.Worker))
## 
## Worked without pay in own family-operated farm or business 
##                                                0.008682063 
##           Employer in own family-operated farm or business 
##                                                0.075360306 
##    Worked with pay in own family-operated farm or business 
##                                                0.000694565 
##                          Self-employed wihout any employee 
##                                                0.400069457 
##                               Worked for private household 
##                                                0.024136135 
##                           Worked for private establishment 
##                                                0.406146901 
##               Worked for government/government corporation 
##                                                0.084910575
prop.table(table(var_train_cat$Type.of.Household))
## 
##                          Single Family Two or More Nonrelated Persons/Members 
##                            0.684285714                            0.003428571 
##                        Extended Family 
##                            0.312285714
prop.table(table(var_train_cat$Type.of.Building.House))
## 
##       Other building unit (e.g. cave, boat) 
##                                0.0000000000 
##                Institutional living quarter 
##                                0.0002857143 
## Commercial/industrial/agricultural building 
##                                0.0010000000 
##                                Single house 
##                                0.9405714286 
##                                      Duplex 
##                                0.0247142857 
##                      Multi-unit residential 
##                                0.0334285714
prop.table(table(var_train_cat$Type.of.Roof))
## 
##                                           Salvaged/makeshift materials 
##                                                            0.004572082 
##                                     Light material (cogon,nipa,anahaw) 
##                                                            0.120731533 
##                             Mixed but predominantly salvaged materials 
##                                                            0.001571653 
##                                Mixed but predominantly light materials 
##                                                            0.019002715 
##                               Mixed but predominantly strong materials 
##                                                            0.049292756 
## Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos) 
##                                                            0.804829261
prop.table(table(var_train_cat$Type.of.Walls))
## 
##     Salvaged   Very Light        Light       Strong Quite Strong 
##   0.01157804   0.03687822   0.19525443   0.67195540   0.08433391
prop.table(table(var_train_cat$Tenure.Status))
## 
##                               Own house, rent lot 
##                                       0.011583012 
##    Own house, rent-free lot with consent of owner 
##                                       0.143715144 
## Own house, rent-free lot without consent of owner 
##                                       0.021593022 
##     Own or owner-like possession of house and lot 
##                                       0.716573717 
##                     Rent house/room including lot 
##                                       0.055627056 
##     Rent-free house and lot with consent of owner 
##                                       0.048048048 
##  Rent-free house and lot without consent of owner 
##                                       0.002860003
prop.table(table(var_train_cat$Toilet.Facilities))
## 
##                                                           None 
##                                                    0.036428571 
##                                                         Others 
##                                                    0.009285714 
##                                                       Open pit 
##                                                    0.029428571 
##                                                     Closed pit 
##                                                    0.051571429 
##    Water-sealed, other depository, shared with other household 
##                                                    0.019000000 
##  Water-sealed, other depository, used exclusively by household 
##                                                    0.061142857 
##   Water-sealed, sewer septic tank, shared with other household 
##                                                    0.092285714 
## Water-sealed, sewer septic tank, used exclusively by household 
##                                                    0.700857143
prop.table(table(var_train_cat$Electricity))
## 
##        No        Si 
## 0.1075714 0.8924286
prop.table(table(var_train_cat$Electricity))
## 
##        No        Si 
## 0.1075714 0.8924286
prop.table(table(var_train_cat$Main.Source.of.Water.Supply))
## 
##                                  Others                                Dug well 
##                             0.002285714                             0.094714286 
##            Lake, river, rain and others  Unprotected spring, river, stream, etc 
##                             0.011571429                             0.015000000 
##    Protected spring, river, stream, etc                Tubed/piped shallow well 
##                             0.065571429                             0.035857143 
##           Shared, tubed/piped deep well          Own use, tubed/piped deep well 
##                             0.144571429                             0.110714286 
##                                 Peddler  Shared, faucet, community water system 
##                             0.021000000                             0.109857143 
## Own use, faucet, community water system 
##                             0.388857143
prop.table(table(var_train_cat$Household.Head.Highest.Grade.Completed))
## 
##                                                                                                                                                                         Agriculture, Forestry, and Fishery Programs 
##                                                                                                                                                                                                        0.0057142857 
##                                                                                                                                                                                  Architecture and Building Programs 
##                                                                                                                                                                                                        0.0008571429 
##                                                                                                                                                                                                       Arts Programs 
##                                                                                                                                                                                                        0.0005714286 
##                                                                                                                                                                                                      Basic Programs 
##                                                                                                                                                                                                        0.0008571429 
##                                                                                                                                                                                Business and Administration Programs 
##                                                                                                                                                                                                        0.0302857143 
##                                                                                                                                                                           Computing/Information Technology Programs 
##                                                                                                                                                                                                        0.0072857143 
##                                                                                                                                                                                                 Elementary Graduate 
##                                                                                                                                                                                                        0.1801428571 
##                                                                                                                                                                         Engineering and Engineering trades Programs 
##                                                                                                                                                                                                        0.0115714286 
##                                                                                                                                                                         Engineering and Engineering Trades Programs 
##                                                                                                                                                                                                        0.0210000000 
##                                                                                                                                                                                   Environmental Protection Programs 
##                                                                                                                                                                                                        0.0002857143 
##                                                                                                                                                                                                  First Year College 
##                                                                                                                                                                                                        0.0207142857 
##                                                                                                                                                                                              First Year High School 
##                                                                                                                                                                                                        0.0298571429 
##                                                                                                                                                                                           First Year Post Secondary 
##                                                                                                                                                                                                        0.0028571429 
##                                                                                                                                                                                                 Fourth Year College 
##                                                                                                                                                                                                        0.0024285714 
##                                                                                                                                                                                                             Grade 1 
##                                                                                                                                                                                                        0.0217142857 
##                                                                                                                                                                                                             Grade 2 
##                                                                                                                                                                                                        0.0375714286 
##                                                                                                                                                                                                             Grade 3 
##                                                                                                                                                                                                        0.0441428571 
##                                                                                                                                                                                                             Grade 4 
##                                                                                                                                                                                                        0.0522857143 
##                                                                                                                                                                                                             Grade 5 
##                                                                                                                                                                                                        0.0497142857 
##                                                                                                                                                                                                             Grade 6 
##                                                                                                                                                                                                        0.0070000000 
##                                                                                                                                                                                                     Health Programs 
##                                                                                                                                                                                                        0.0094285714 
##                                                                                                                                                                                                High School Graduate 
##                                                                                                                                                                                                        0.2378571429 
##                                                                                                                                                                                                 Humanities Programs 
##                                                                                                                                                                                                        0.0012857143 
##                                                                                                                                                                                 Journalism and Information Programs 
##                                                                                                                                                                                                        0.0010000000 
##                                                                                                                                                                                                        Law Programs 
##                                                                                                                                                                                                        0.0007142857 
##                                                                                                                                                                                              Life Sciences Programs 
##                                                                                                                                                                                                        0.0005714286 
##                                                                                                                                                                               Manufacturing and Processing Programs 
##                                                                                                                                                                                                        0.0004285714 
##                                                                                                                                                                                 Mathematics and Statistics Programs 
##                                                                                                                                                                                                        0.0002857143 
##                                                                                                                                                                                                  No Grade Completed 
##                                                                                                                                                                                                        0.0278571429 
##                                                        Other Programs in Education at the Third Level, First Stage, of the Type that Leads to an Award not Equivalent to a First University or Baccalaureate Degree 
##                                                                                                                                                                                                        0.0018571429 
## Other Programs of Education at the Third Level, First Stage, of the Type that Leads to a Baccalaureate or First University/Professional Degree (HIgher Education Level, First Stage, or Collegiate Education Level) 
##                                                                                                                                                                                                        0.0000000000 
##                                                                                                                                                                                          Personal Services Programs 
##                                                                                                                                                                                                        0.0031428571 
##                                                                                                                                                                                          Physical Sciences Programs 
##                                                                                                                                                                                                        0.0004285714 
##                                                                                                                                                                                                  Post Baccalaureate 
##                                                                                                                                                                                                        0.0055714286 
##                                                                                                                                                                                                           Preschool 
##                                                                                                                                                                                                        0.0004285714 
##                                                                                                                                                                                                 Second Year College 
##                                                                                                                                                                                                        0.0308571429 
##                                                                                                                                                                                             Second Year High School 
##                                                                                                                                                                                                        0.0518571429 
##                                                                                                                                                                                          Second Year Post Secondary 
##                                                                                                                                                                                                        0.0031428571 
##                                                                                                                                                                                          Security Services Programs 
##                                                                                                                                                                                                        0.0074285714 
##                                                                                                                                                                              Social and Behavioral Science Programs 
##                                                                                                                                                                                                        0.0032857143 
##                                                                                                                                                                                            Social Services Programs 
##                                                                                                                                                                                                        0.0000000000 
##                                                                                                                                                                    Teacher Training and Education Sciences Programs 
##                                                                                                                                                                                                        0.0227142857 
##                                                                                                                                                                                                  Third Year College 
##                                                                                                                                                                                                        0.0238571429 
##                                                                                                                                                                                              Third Year High School 
##                                                                                                                                                                                                        0.0334285714 
##                                                                                                                                                                                         Transport Services Programs 
##                                                                                                                                                                                                        0.0055714286 
##                                                                                                                                                                                                 Veterinary Programs 
##                                                                                                                                                                                                        0.0001428571

Al examinar las visualizaciones, la variable categórica electricity llama la atención. Se procede a comparar la variable electricity por regiones, ya que puede dar una idea acerca de en qué regiones puede existir mayor nivel de pobreza. Esto se realiza mediante la función cross-table, que nos mostrará las frecuencias absolutas, relativas en relación a la fila, frecuencias relativas en relación a la columna y frecuencias relativas globales.

CrossTable(var_train_cat$Region, var_train_cat$Electricity, prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  7000 
## 
##  
##                           | var_train_cat$Electricity 
##      var_train_cat$Region |        No |        Si | Row Total | 
## --------------------------|-----------|-----------|-----------|
##                      ARMM |       162 |       213 |       375 | 
##                           |     0.432 |     0.568 |     0.054 | 
##                           |     0.215 |     0.034 |           | 
##                           |     0.023 |     0.030 |           | 
## --------------------------|-----------|-----------|-----------|
##                       CAR |        18 |       281 |       299 | 
##                           |     0.060 |     0.940 |     0.043 | 
##                           |     0.024 |     0.045 |           | 
##                           |     0.003 |     0.040 |           | 
## --------------------------|-----------|-----------|-----------|
##                    Caraga |        24 |       275 |       299 | 
##                           |     0.080 |     0.920 |     0.043 | 
##                           |     0.032 |     0.044 |           | 
##                           |     0.003 |     0.039 |           | 
## --------------------------|-----------|-----------|-----------|
##         I - Ilocos Region |        20 |       381 |       401 | 
##                           |     0.050 |     0.950 |     0.057 | 
##                           |     0.027 |     0.061 |           | 
##                           |     0.003 |     0.054 |           | 
## --------------------------|-----------|-----------|-----------|
##       II - Cagayan Valley |        23 |       344 |       367 | 
##                           |     0.063 |     0.937 |     0.052 | 
##                           |     0.031 |     0.055 |           | 
##                           |     0.003 |     0.049 |           | 
## --------------------------|-----------|-----------|-----------|
##       III - Central Luzon |        13 |       519 |       532 | 
##                           |     0.024 |     0.976 |     0.076 | 
##                           |     0.017 |     0.083 |           | 
##                           |     0.002 |     0.074 |           | 
## --------------------------|-----------|-----------|-----------|
##          IVA - CALABARZON |        26 |       632 |       658 | 
##                           |     0.040 |     0.960 |     0.094 | 
##                           |     0.035 |     0.101 |           | 
##                           |     0.004 |     0.090 |           | 
## --------------------------|-----------|-----------|-----------|
##            IVB - MIMAROPA |        26 |       167 |       193 | 
##                           |     0.135 |     0.865 |     0.028 | 
##                           |     0.035 |     0.027 |           | 
##                           |     0.004 |     0.024 |           | 
## --------------------------|-----------|-----------|-----------|
## IX - Zasmboanga Peninsula |        53 |       266 |       319 | 
##                           |     0.166 |     0.834 |     0.046 | 
##                           |     0.070 |     0.043 |           | 
##                           |     0.008 |     0.038 |           | 
## --------------------------|-----------|-----------|-----------|
##                       NCR |         6 |       718 |       724 | 
##                           |     0.008 |     0.992 |     0.103 | 
##                           |     0.008 |     0.115 |           | 
##                           |     0.001 |     0.103 |           | 
## --------------------------|-----------|-----------|-----------|
##          V - Bicol Region |        47 |       365 |       412 | 
##                           |     0.114 |     0.886 |     0.059 | 
##                           |     0.062 |     0.058 |           | 
##                           |     0.007 |     0.052 |           | 
## --------------------------|-----------|-----------|-----------|
##      VI - Western Visayas |        67 |       430 |       497 | 
##                           |     0.135 |     0.865 |     0.071 | 
##                           |     0.089 |     0.069 |           | 
##                           |     0.010 |     0.061 |           | 
## --------------------------|-----------|-----------|-----------|
##     VII - Central Visayas |        47 |       369 |       416 | 
##                           |     0.113 |     0.887 |     0.059 | 
##                           |     0.062 |     0.059 |           | 
##                           |     0.007 |     0.053 |           | 
## --------------------------|-----------|-----------|-----------|
##    VIII - Eastern Visayas |        64 |       333 |       397 | 
##                           |     0.161 |     0.839 |     0.057 | 
##                           |     0.085 |     0.053 |           | 
##                           |     0.009 |     0.048 |           | 
## --------------------------|-----------|-----------|-----------|
##     X - Northern Mindanao |        44 |       252 |       296 | 
##                           |     0.149 |     0.851 |     0.042 | 
##                           |     0.058 |     0.040 |           | 
##                           |     0.006 |     0.036 |           | 
## --------------------------|-----------|-----------|-----------|
##         XI - Davao Region |        48 |       384 |       432 | 
##                           |     0.111 |     0.889 |     0.062 | 
##                           |     0.064 |     0.061 |           | 
##                           |     0.007 |     0.055 |           | 
## --------------------------|-----------|-----------|-----------|
##        XII - SOCCSKSARGEN |        65 |       318 |       383 | 
##                           |     0.170 |     0.830 |     0.055 | 
##                           |     0.086 |     0.051 |           | 
##                           |     0.009 |     0.045 |           | 
## --------------------------|-----------|-----------|-----------|
##              Column Total |       753 |      6247 |      7000 | 
##                           |     0.108 |     0.892 |           | 
## --------------------------|-----------|-----------|-----------|
## 
## 
#CrossTable(var_train_cat$Household.Head.Class.of.Worker, var_train_cat$Number.of.Stove.with.Oven.Gas.Range, prop.chisq = FALSE)

Se incorpora al análisis una tercera variable que suscita interés en el estudio: la variable Sex, que indica el sexo de la persona que toma las decisiones en el hogar.

# ----- Estudio de frecuencias multidimensionales -----

# Análisis de la variable electricity/región/sexo

ftable(var_train_cat$Region, var_train_cat$Household.Head.Sex, var_train_cat$Electricity)
##                                    No  Si
##                                          
##  ARMM                     Female    9  18
##                           Male    153 195
## CAR                       Female    1  65
##                           Male     17 216
## Caraga                    Female    4  46
##                           Male     20 229
## I - Ilocos Region         Female    7  97
##                           Male     13 284
## II - Cagayan Valley       Female    1  50
##                           Male     22 294
## III - Central Luzon       Female    4 124
##                           Male      9 395
## IVA - CALABARZON          Female    9 156
##                           Male     17 476
## IVB - MIMAROPA            Female    1  36
##                           Male     25 131
## IX - Zasmboanga Peninsula Female    7  51
##                           Male     46 215
## NCR                       Female    1 194
##                           Male      5 524
## V - Bicol Region          Female    8  86
##                           Male     39 279
## VI - Western Visayas      Female   14 106
##                           Male     53 324
## VII - Central Visayas     Female   10 106
##                           Male     37 263
## VIII - Eastern Visayas    Female    9  77
##                           Male     55 256
## X - Northern Mindanao     Female    8  53
##                           Male     36 199
## XI - Davao Region         Female   11  73
##                           Male     37 311
## XII - SOCCSKSARGEN        Female   11  60
##                           Male     54 258

Finalmente, se muestra una serie de visualizaciones de los datos mediante diagramas de barras:

# ----- Gráficos EDA con variables cualitativas individuales -----

ggplot(datos, aes(Region)) + geom_bar() + ggtitle("Núm. familias. por Región") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(datos, aes(Main.Source.of.Income)) + geom_bar() + ggtitle("Núm. familias. por fuente de ingresos")

# ----- Visualización de datos cualitativos -----

barplot(table(datos$Region), col = c("lightblue","yellow", "cadetblue4"),
        main = "Diagrama de barras de las frecuencias absolutas\n de la variable \"Region\"")

barplot(table(datos$Household.Head.Sex, datos$Electricity),
        beside = T, 
        col = c("yellow", "lightblue"),
        names = c("Women", "Men"), 
        legend.text = c("No", "Yes"))

barplot(prop.table(table(datos$Household.Head.Class.of.Worker,datos$Main.Source.of.Income)),
        beside = TRUE, col = c("chocolate","cornsilk1","cornflowerblue","blueviolet", "darkgoldenrod1", "coral", "brown", "chartreuse4"),
        legend.text = T, main = "Frecuencias relativas de fuente de\n ingresos por tipo de trabajo",
        ylim = c(0,1))

4.3. Análisis variables cuantitativas

Se dispone a ver la distribución y densidad de cada una de las variables cuantitativas sin transformar, es decir, las variables “en crudo”. De esta manera, se pretende identificar aquellas con los datos más sesgados, y poder observar las distribuciones y rangos que presentan.

# Histograma de las variables cuantitativas sin transformar

summary(var_train_num$Total.Household.Income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17840  106132  165773  248096  294968 4942530
qplot(var_train_num$Total.Household.Income,
      geom="histogram",
      binwidth = 10000,
      main = "Histogram for Total Household Income", 
      xlab = "Total Household Income",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(10000,12000000))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Total.Food.Expenditure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6275   51422   73354   85554  106458  720007
qplot(var_train_num$Total.Food.Expenditure,
      geom="histogram",
      binwidth = 10000,
      main = "Histogram for Total Food Expenditure", 
      xlab = "Total Food Expenditure",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(2000,800000))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Bread.and.Cereals.Expenditure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   16665   23196   24978   31200  345643
qplot(var_train_num$Bread.and.Cereals.Expenditure,
      geom="histogram",
      binwidth = 1000,
      main = "Histogram for Bread.and.Cereals.Expenditure", 
      xlab = "Bread.and.Cereals.Expenditure",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1000,350000))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Total.Rice.Expenditure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   10909   16473   18014   23903  343907
qplot(var_train_num$Total.Rice.Expenditure,
      geom="histogram",
      binwidth = 1000,
      main = "Histogram for Total Rice Expenditure", 
      xlab = "Total Rice Expenditure",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1000,350000))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Meat.Expenditure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3336    7460   10626   14253  261566
qplot(var_train_num$Meat.Expenditure,
      geom="histogram",
      binwidth = 1000,
      main = "Histogram for Meat.Expenditure", 
      xlab = "Meat.Expenditure",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1000,270000))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Total.Fish.and..marine.products.Expenditure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    5492    8649   10489   13212   81675
qplot(var_train_num$Total.Fish.and..marine.products.Expenditure,
      geom="histogram",
      binwidth = 1000,
      main = "Histogram for Total.Fish.and..marine.products.Expenditure", 
      xlab = "Total.Fish.and..marine.products.Expenditure",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1000,190000))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Fruit.Expenditure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1012    1830    2544    3114   82600
qplot(var_train_num$Fruit.Expenditure,
      geom="histogram",
      binwidth = 1000,
      main = "Histogram for Fruit.Expenditure", 
      xlab = "Fruit.Expenditure",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1000,70000))
## Warning: Removed 1 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Vegetables.Expenditure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2876    4393    5066    6400   49810
qplot(var_train_num$Vegetables.Expenditure,
      geom="histogram",
      binwidth = 1000,
      main = "Histogram for Vegetables.Expenditure", 
      xlab = "Vegetables.Expenditure",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1000,80000))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Restaurant.and.hotels.Expenditure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2020    7480   16031   20892  421950
qplot(var_train_num$Restaurant.and.hotels.Expenditure,
      geom="histogram",
      binwidth = 5000,
      main = "Histogram for Restaurant.and.hotels.Expenditure", 
      xlab = "Restaurant.and.hotels.Expenditure",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-5000,520000))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Alcoholic.Beverages.Expenditure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0     276    1095    1300   38220
qplot(var_train_num$Alcoholic.Beverages.Expenditure,
      geom="histogram",
      binwidth = 1000,
      main = "Histogram for Alcoholic.Beverages.Expenditure", 
      xlab = "Alcoholic.Beverages.Expenditure",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1000,36000))
## Warning: Removed 1 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Tobacco.Expenditure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0     195    2275    3120   97740
qplot(var_train_num$Tobacco.Expenditure,
      geom="histogram",
      binwidth = 1000,
      main = "Histogram for Tobacco.Expenditure", 
      xlab = "Tobacco.Expenditure",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1000,100000))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Clothing..Footwear.and.Other.Wear.Expenditure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1372    2776    4975    5771  112830
qplot(var_train_num$Clothing..Footwear.and.Other.Wear.Expenditure,
      geom="histogram",
      binwidth = 10000,
      main = "Histogram for Clothing..Footwear.and.Other.Wear.Expenditure", 
      xlab = "Clothing..Footwear.and.Other.Wear.Expenditure",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-10000,360000))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Housing.and.water.Expenditure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2310   13140   23163   38826   47157 1308180
qplot(var_train_num$Housing.and.water.Expenditure,
      geom="histogram",
      binwidth = 10000,
      main = "Histogram for Housing.and.water.Expenditure", 
      xlab = "Housing.and.water.Expenditure",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(2000,842000))
## Warning: Removed 5 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Imputed.House.Rental.Value)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    6000   10800   21088   24000 1200000
qplot(var_train_num$Imputed.House.Rental.Value,
      geom="histogram",
      binwidth = 1000,
      main = "Histogram for Imputed.House.Rental.Value", 
      xlab = "Imputed.House.Rental.Value",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1000,730000))
## Warning: Removed 2 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Medical.Care.Expenditure)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.0    282.8   1099.0   7317.3   4597.2 672466.0
qplot(var_train_num$Medical.Care.Expenditure,
      geom="histogram",
      binwidth = 10000,
      main = "Histogram for Medical.Care.Expenditure", 
      xlab = "Medical.Care.Expenditure",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-10000,1000000))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Transportation.Expenditure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2434    6240   11926   13934  240000
qplot(var_train_num$Transportation.Expenditure,
      geom="histogram",
      binwidth = 10000,
      main = "Histogram for Transportation.Expenditure", 
      xlab = "Transportation.Expenditure",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-10000,500000))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Communication.Expenditure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     576    1560    4135    3998   87600
qplot(var_train_num$Communication.Expenditure,
      geom="histogram",
      binwidth = 1000,
      main = "Histogram for Communication.Expenditure", 
      xlab = "Communication.Expenditure",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1000,100000))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Education.Expenditure)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.0      0.0    884.5   7190.9   4120.0 396000.0
qplot(var_train_num$Education.Expenditure,
      geom="histogram",
      binwidth = 10000,
      main = "Histogram for Education.Expenditure", 
      xlab = "Education.Expenditure",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-10000,340000))
## Warning: Removed 1 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Miscellaneous.Goods.and.Services.Expenditure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      18    3862    6867   12601   14151  292086
qplot(var_train_num$Miscellaneous.Goods.and.Services.Expenditure,
      geom="histogram",
      binwidth = 10000,
      main = "Histogram for Miscellaneous.Goods.and.Services.Expenditure", 
      xlab = "Miscellaneous.Goods.and.Services.Expenditure",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-10000,320000))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Special.Occasions.Expenditure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0    1500    5353    5000  340000
qplot(var_train_num$Special.Occasions.Expenditure,
      geom="histogram",
      binwidth = 10000,
      main = "Histogram for Special.Occasions.Expenditure", 
      xlab = "Special.Occasions.Expenditure",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-10000,310000))
## Warning: Removed 1 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Crop.Farming.and.Gardening.expenses)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0   13928    6592 1779690
qplot(var_train_num$Crop.Farming.and.Gardening.expenses,
      geom="histogram",
      binwidth = 100000,
      main = "Histogram for Crop.Farming.and.Gardening.expenses", 
      xlab = "Crop.Farming.and.Gardening.expenses",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-100000,3800000))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Total.Income.from.Entrepreneurial.Acitivites)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0   18990   54199   65950 4798140
qplot(var_train_num$Total.Income.from.Entrepreneurial.Acitivites,
      geom="histogram",
      binwidth = 100000,
      main = "Histogram for Total.Income.from.Entrepreneurial.Acitivites", 
      xlab = "Total.Income.from.Entrepreneurial.Acitivites",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-100000,4800000))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Household.Head.Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   41.00   50.00   51.26   61.00   98.00
qplot(var_train_num$Household.Head.Age,
      geom="histogram",
      binwidth = 5,
      main = "Histogram for Household.Head.Age", 
      xlab = "Household.Head.Age",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(10,100))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Total.Number.of.Family.members)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   4.000   4.649   6.000  20.000
qplot(var_train_num$Total.Number.of.Family.members,
      geom="histogram",
      binwidth = 1,
      main = "Histogram for Total.Number.of.Family.members", 
      xlab = "Total.Number.of.Family.members",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1,23))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Total.number.of.family.members.employed)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.281   2.000   8.000
qplot(var_train_num$Total.number.of.family.members.employed,
      geom="histogram",
      binwidth = 1,
      main = "Histogram for Total.number.of.family.members.employed", 
      xlab = "Total.number.of.family.members.employed",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1,10))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$House.Floor.Area)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   25.00   40.00   55.64   70.00  998.00
qplot(var_train_num$House.Floor.Area,
      geom="histogram",
      binwidth = 25,
      main = "Histogram for House.Floor.Area", 
      xlab = "House.Floor.Area",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-25,1000))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$House.Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   17.00   20.13   27.00  150.00
qplot(var_train_num$House.Age,
      geom="histogram",
      binwidth = 5,
      main = "Histogram for House.Age", 
      xlab = "House.Age",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1,130))
## Warning: Removed 1 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Number.of.bedrooms)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   1.792   2.000   9.000
qplot(var_train_num$Number.of.bedrooms,
      geom="histogram",
      binwidth = 1,
      main = "Histogram for Number.of.bedrooms", 
      xlab = "Number.of.bedrooms",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1,10))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Number.of.Television)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  1.0000  1.0000  0.8621  1.0000  6.0000
qplot(var_train_num$Number.of.Television,
      geom="histogram",
      binwidth = 1,
      main = "Histogram for Number.of.Television", 
      xlab = "Number.of.Television",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1,7))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Number.of.CD.VCD.DVD)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4489  1.0000  4.0000
qplot(var_train_num$Number.of.CD.VCD.DVD,
      geom="histogram",
      binwidth = 1,
      main = "Histogram for Number.of.CD.VCD.DVD", 
      xlab = "Number.of.CD.VCD.DVD",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1,7))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Number.of.Component.Stereo.set)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1636  0.0000  5.0000
qplot(var_train_num$Number.of.Component.Stereo.set,
      geom="histogram",
      binwidth = 1,
      main = "Histogram for Number.of.Component.Stereo.set", 
      xlab = "Number.of.Component.Stereo.set",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1,7))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Number.of.Refrigerator.Freezer)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4067  1.0000  5.0000
qplot(var_train_num$Number.of.Refrigerator.Freezer,
      geom="histogram",
      binwidth = 1,
      main = "Histogram for Number.of.Refrigerator.Freezer", 
      xlab = "Number.of.Refrigerator.Freezer",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1,7))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Number.of.Washing.Machine)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3261  1.0000  3.0000
qplot(var_train_num$Number.of.Washing.Machine,
      geom="histogram",
      binwidth = 1,
      main = "Histogram for Number.of.Washing.Machine", 
      xlab = "Number.of.Washing.Machine",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1,5))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Number.of.Airconditioner)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1291  0.0000  5.0000
qplot(var_train_num$Number.of.Airconditioner,
      geom="histogram",
      binwidth = 1,
      main = "Histogram for Number.of.Airconditioner", 
      xlab = "Number.of.Airconditioner",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1,7))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Number.of.Car..Jeep..Van)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00    0.08    0.00    5.00
qplot(var_train_num$Number.of.Car..Jeep..Van,
      geom="histogram",
      binwidth = 1,
      main = "Histogram for Number.of.Car..Jeep..Van", 
      xlab = "Number.of.Car..Jeep..Van",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1,6))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Number.of.Landline.wireless.telephones)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06171 0.00000 4.00000
qplot(var_train_num$Number.of.Landline.wireless.telephones,
      geom="histogram",
      binwidth = 1,
      main = "Histogram for Number.of.Landline.wireless.telephones", 
      xlab = "Number.of.Landline.wireless.telephones",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1,7))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Number.of.Cellular.phone)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    2.00    1.95    3.00   10.00
qplot(var_train_num$Number.of.Cellular.phone,
      geom="histogram",
      binwidth = 1,
      main = "Histogram for Number.of.Cellular.phone", 
      xlab = "Number.of.Cellular.phone",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1,12))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Number.of.Personal.Computer)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3287  0.0000  6.0000
qplot(var_train_num$Number.of.Personal.Computer,
      geom="histogram",
      binwidth = 1,
      main = "Histogram for Number.of.Personal.Computer", 
      xlab = "Number.of.Personal.Computer",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1,8))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Number.of.Stove.with.Oven.Gas.Range)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1346  0.0000  3.0000
qplot(var_train_num$Number.of.Stove.with.Oven.Gas.Range,
      geom="histogram",
      binwidth = 1,
      main = "Histogram for Number.of.Stove.with.Oven.Gas.Range", 
      xlab = "Number.of.Stove.with.Oven.Gas.Range",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1,4))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Number.of.Motorized.Banca)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01114 0.00000 3.00000
qplot(var_train_num$Number.of.Motorized.Banca,
      geom="histogram",
      binwidth = 1,
      main = "Histogram for Number.of.Motorized.Banca", 
      xlab = "Number.of.Motorized.Banca",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1,5))
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(var_train_num$Number.of.Motorcycle.Tricycle)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3019  1.0000  5.0000
qplot(var_train_num$Number.of.Motorcycle.Tricycle,
      geom="histogram",
      binwidth = 1,
      main = "Histogram for Number.of.Motorcycle.Tricycle", 
      xlab = "Number.of.Motorcycle.Tricycle",  
      fill=I("blue"), 
      col=I("red"), 
      xlim=c(-1,7))
## Warning: Removed 2 rows containing missing values (geom_bar).

Puede observarse que la mayoría de las variables están sesgadas a la derecha, una característica común cuando tratamos con datos socioeconómicos. A simple vista, muy pocas tienen una distribución simétrica, como sea el caso de la distribución normal de la variable “House.Age”.

3. Representación datos cualitativos y cuantitativos

Como parte final del análisis exploratorio de datos, se muestran algunas visualizaciones interesantes donde observar el tipo de población de una familia filipina en 2017. Viendo estas gráficas, podría afirmarse que se trata de una población mayormente agraria, en la que abundan los trabajos de campo. Además, las familias de la Región NAT son aquellas que más gastos tienen.

# Profesiones más comúnes en Filipinas

by_common_jobs <- datos_occupation %>% 
  group_by(Household.Head.Occupation) %>%
  summarise(Total = n()) %>%
  arrange(desc(Total)) %>%
  head(20) %>% ungroup()


ggplot(data = by_common_jobs) + geom_bar(mapping = aes(x = Household.Head.Occupation, y = Total), stat = "identity") + labs(title="Trabajos más comunes en familias filipinas") + theme(axis.text.x = element_text(angle = 30, hjust = 1))

# Región y gastos

by_region_educ <- datos_occupation %>%
  group_by(Region, Education.Expenditure, Housing.and.water.Expenditure) %>% 
  summarise(Total = n()) %>%
  arrange(desc(Total)) %>% ungroup()


# Para ver el boxplot es necesario transformar la variable

ggplot(by_region_educ, aes(x=Region, y=Education.Expenditure)) + geom_boxplot(color="black", fill="orange", alpha = 0.6) + scale_y_log10() + labs(title="Gasto de educación por regiones") + theme(axis.text.x = element_text(angle = 30, hjust = 1))

5. Imputación de datos faltantes y tratamiento de variables

5.1 Método KNN

Una vez hecho el análisis EDA, con un mejor conocimiento de los datos disponibles, es hora de empezar a prepararlos para diseñar el modelo. El primer paso es un diagnóstico de valores faltantes, que tendremos que imputar con valores factibles.

Se recuerda que, a partir de ahora, se trabajará con el conjunto de datos train, ya que los datos test no serán utilizados hasta la última parte de este trabajo.

# ----- Detección e imputación de datos faltantes -----

# Cálculo del número total de NA en el conjunto de datos de train

length(which(is.na(datos_training)))
## [1] 1253
# Cálculo del número total de filas que contienen al menos un NA en el conjunto de datos de train

length(which(!complete.cases(datos_training)))
## [1] 1250

Existen bastantes valores NA en el conjunto, pero todos corresponden a las variables cualitativas. Se muestra gráficamente como se distribuyenlos NA en el conjunto de datos correspondiente a las variables cualitativas.

# Número de NA en el conjunto de variables cuantitativas y en el conjunto de las cualitativas

length(which(is.na(var_train_num)))
## [1] 0
length(which(is.na(var_train_cat)))
## [1] 1253
length(which(is.na(var_train_cat$Tenure.Status)))
## [1] 7
# Visualización gráfica de la distribución de NA en el conjunto de datos correspondiente a las variables cualitativas

aggr_plot<-aggr(var_train_cat
                ,numbers=TRUE,sortVars=TRUE,
                labels=names(var_train_cat)
                ,cex.axis=.7,gap=3
                ,ylab=c('Histograma de datos faltantes','Patrones de datos faltantes'),
                only.miss=TRUE)
## Warning in plot.aggr(res, ...): not enough horizontal space to display
## frequencies

## 
##  Variables sorted by number of missings: 
##                                  Variable        Count
##            Household.Head.Class.of.Worker 0.1772857143
##                             Tenure.Status 0.0010000000
##                             Type.of.Walls 0.0005714286
##                              Type.of.Roof 0.0001428571
##                                    Region 0.0000000000
##                     Main.Source.of.Income 0.0000000000
##                        Household.Head.Sex 0.0000000000
##             Household.Head.Marital.Status 0.0000000000
##    Household.Head.Highest.Grade.Completed 0.0000000000
##  Household.Head.Job.or.Business.Indicator 0.0000000000
##                         Type.of.Household 0.0000000000
##                    Type.of.Building.House 0.0000000000
##                         Toilet.Facilities 0.0000000000
##                               Electricity 0.0000000000
##               Main.Source.of.Water.Supply 0.0000000000
# Tabla de contingencias de las variables cuyos NA serán imputados

table_pre_Tenure<-prop.table(table(var_train_cat$Tenure.Status))
table_pre_Worker<-prop.table(table(var_train_cat$Household.Head.Class.of.Worker))
table_pre_Walls<-prop.table(table(var_train_cat$Type.of.Walls))
table_pre_Roof<-prop.table(table(var_train_cat$Type.of.Roof))

# Summary de las 4 variables cuyos NA serán imputados

summary_Tenure <- summary(var_train_cat$Tenure.Status)
summary_Worker <- summary(var_train_cat$Household.Head.Class.of.Worker)
summary_Walls<-summary(var_train_cat$Type.of.Walls)
summary_Roof<-summary(var_train_cat$Type.of.Roof)

Al decidir qué método de imputación de datos faltantes utilizar, es conveniente tener en cuenta que se está trabajando tratando con variables categóricas, y que el modelo a diseñar será una regresión lineal múltiple.

Por ello, una buena opción es el método no lineal KNN (k nearest neighbors), el cual calcula la distancia del elemento nuevo a cada uno de los existentes, y ordena dichas distancias de menor a mayor para ir seleccionando el grupo al que pertenece. Por lo tanto, dicho grupo será aquel que tenga una menor distacia con la mayor frecuencia.

# Imputación de los valores NA usando el método no lineal kNN (k nearest neighbors)

var_train_cat <- VIM::kNN(var_train_cat,variable='Tenure.Status',impNA=TRUE)
var_train_cat$Tenure.Status_imp<-NULL
var_train_cat <- VIM::kNN(var_train_cat,variable='Household.Head.Class.of.Worker',impNA=TRUE)
var_train_cat$Household.Head.Class.of.Worker_imp<-NULL
var_train_cat <- VIM::kNN(var_train_cat,variable='Type.of.Walls',impNA=TRUE)
var_train_cat$Type.of.Walls_imp<-NULL
var_train_cat <- VIM::kNN(var_train_cat,variable='Type.of.Roof',impNA=TRUE)
var_train_cat$Type.of.Roof_imp<-NULL


# Comprobación de que se han eliminado todos los NA del conjunto de variables categóricas

length(which(is.na(var_train_cat)))
## [1] 0
# Calculamos las tablas de contingencia tras haber imputado los NA con kNN

table_pos_Tenure<-prop.table(table(var_train_cat$Tenure.Status))
table_pos_Worker<-prop.table(table(var_train_cat$Household.Head.Class.of.Worker))
table_pos_Walls<-prop.table(table(var_train_cat$Type.of.Walls))
table_pos_Roof<-prop.table(table(var_train_cat$Type.of.Roof))

Finalmente, se comprueba que las proporciones no se han visto afectadas por la imputación.

# Comprobación de que las proporciones no se han visto afectadas por la imputación

porc_dif_Tenure <- (table_pos_Tenure*100)-(table_pre_Tenure*100)
porc_dif_Worker <- (table_pos_Worker*100)-(table_pre_Worker*100)
porc_dif_Walls <- (table_pos_Walls*100)-(table_pre_Walls*100)
porc_dif_Roof <- (table_pos_Roof*100)-(table_pre_Roof*100)

5.2 Transformación de variables

Para utilizar un modelo de regresión lineal múltiple, es muy conveniente que se cumplan las siquientes condiciones:

  • Las variables tienen que tener distribución normal (en la medida de lo posible).
  • Las variables no deben estar altamente correlacionadas entre sí.

Por lo tanto, para poder aplicar un modelo de regresión multiple a las variables numéricas del presente trabajo, es necesario plantear una transformación para que se acerquen lo más posible a una distribución normal. Se recuerda que durante el análisis eda, se constató que la mayoría de las variables mostraban un sesgo a la derecha, lo cual podría estropear el diseño del modelo (distribuciones no normales).

# Normalización de las variables numericas usando scale (media 0 y desviación típica 1)

var_train_num_NORM <- (scale(var_train_num,center=T,scale=T))


# Histogramas de las variables cuantitativas normalizadas

summary(var_train_num_NORM[1:7000,1])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.8689 -0.5357 -0.3107  0.0000  0.1769 17.7158
qplot(var_train_num_NORM[1:7000,1],
      geom="histogram",
      main = "Histogram for Total Household Income", 
      xlab = "Total Household Income",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,2])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.5470 -0.6660 -0.2381  0.0000  0.4079 12.3802
qplot(var_train_num_NORM[1:7000,2],
      geom="histogram",
      main = "Histogram for Total Food Expenditure", 
      xlab = "Total Food Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,3])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.9653 -0.6541 -0.1402  0.0000  0.4895 25.2303
qplot(var_train_num_NORM[1:7000,3],
      geom="histogram",
      main = "Histogram for Bread.and.Cereals.Expenditure", 
      xlab = "Bread.and.Cereals.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,4])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.6333 -0.6442 -0.1397  0.0000  0.5339 29.5476
qplot(var_train_num_NORM[1:7000,4],
      geom="histogram",
      main = "Histogram for Total Rice Expenditure", 
      xlab = "Total Rice Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,5])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.9792 -0.6718 -0.2917  0.0000  0.3342 23.1227
qplot(var_train_num_NORM[1:7000,5],
      geom="histogram",
      main = "Histogram for Meat.Expenditure", 
      xlab = "Meat.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,6])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.3673 -0.6515 -0.2398  0.0000  0.3551  9.2801
qplot(var_train_num_NORM[1:7000,6],
      geom="histogram",
      main = "Histogram for Total.Fish.and..marine.products.Expenditure", 
      xlab = "Total.Fish.and..marine.products.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,7])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.9286 -0.5592 -0.2607  0.0000  0.2080 29.2193
qplot(var_train_num_NORM[1:7000,7],
      geom="histogram",
      main = "Histogram for Fruit.Expenditure", 
      xlab = "Fruit.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,8])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.5183 -0.6563 -0.2016  0.0000  0.3999 13.4106
qplot(var_train_num_NORM[1:7000,8],
      geom="histogram",
      main = "Histogram for Vegetables.Expenditure", 
      xlab = "Vegetables.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,9])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.6627 -0.5792 -0.3535  0.0000  0.2010 16.7806
qplot(var_train_num_NORM[1:7000,9],
      geom="histogram",
      main = "Histogram for Restaurant.and.hotels.Expenditure", 
      xlab = "Restaurant.and.hotels.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,10])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.50105 -0.50105 -0.37471  0.00000  0.09402 16.99393
qplot(var_train_num_NORM[1:7000,10],
      geom="histogram",
      main = "Histogram for Alcoholic.Beverages.Expenditure", 
      xlab = "Alcoholic.Beverages.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,11])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.5560 -0.5560 -0.5083  0.0000  0.2063 23.3245
qplot(var_train_num_NORM[1:7000,11],
      geom="histogram",
      main = "Histogram for Tobacco.Expenditure", 
      xlab = "Tobacco.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,12])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.7271 -0.5265 -0.3213  0.0000  0.1164 15.7642
qplot(var_train_num_NORM[1:7000,12],
      geom="histogram",
      main = "Histogram for Clothing..Footwear.and.Other.Wear.Expenditure", 
      xlab = "Clothing..Footwear.and.Other.Wear.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,13])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.6748 -0.4746 -0.2894  0.0000  0.1539 23.4559
qplot(var_train_num_NORM[1:7000,13],
      geom="histogram",
      main = "Histogram for Housing.and.water.Expenditure", 
      xlab = "Housing.and.water.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,14])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.53487 -0.38268 -0.26094  0.00000  0.07387 29.90183
qplot(var_train_num_NORM[1:7000,14],
      geom="histogram",
      main = "Histogram for Imputed.House.Rental.Value", 
      xlab = "Imputed.House.Rental.Value",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,15])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.26125 -0.25116 -0.22201  0.00000 -0.09712 23.74789
qplot(var_train_num_NORM[1:7000,15],
      geom="histogram",
      main = "Histogram for Medical.Care.Expenditure", 
      xlab = "Medical.Care.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,16])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.6650 -0.5292 -0.3170  0.0000  0.1119 12.7173
qplot(var_train_num_NORM[1:7000,16],
      geom="histogram",
      main = "Histogram for Transportation.Expenditure", 
      xlab = "Transportation.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,17])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.57729 -0.49687 -0.35948  0.00000 -0.01915 11.65362
qplot(var_train_num_NORM[1:7000,17],
      geom="histogram",
      main = "Histogram for Communication.Expenditure", 
      xlab = "Communication.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,18])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.3836 -0.3836 -0.3364  0.0000 -0.1638 20.7404
qplot(var_train_num_NORM[1:7000,18],
      geom="histogram",
      main = "Histogram for Education.Expenditure", 
      xlab = "Education.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,19])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.76389 -0.53049 -0.34809  0.00000  0.09411 16.96726
qplot(var_train_num_NORM[1:7000,19],
      geom="histogram",
      main = "Histogram for Miscellaneous.Goods.and.Services.Expenditure", 
      xlab = "Miscellaneous.Goods.and.Services.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,20])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.38679 -0.38679 -0.27841  0.00000 -0.02553 24.17866
qplot(var_train_num_NORM[1:7000,20],
      geom="histogram",
      main = "Histogram for Special.Occasions.Expenditure", 
      xlab = "Special.Occasions.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,21])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2998 -0.2998 -0.2998  0.0000 -0.1579 38.0010
qplot(var_train_num_NORM[1:7000,21],
      geom="histogram",
      main = "Histogram for Crop.Farming.and.Gardening.expenses", 
      xlab = "Crop.Farming.and.Gardening.expenses",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,22])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.39734 -0.39734 -0.25812  0.00000  0.08615 34.77798
qplot(var_train_num_NORM[1:7000,22],
      geom="histogram",
      main = "Histogram for Total.Income.from.Entrepreneurial.Acitivites", 
      xlab = "Total.Income.from.Entrepreneurial.Acitivites",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,23])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.55382 -0.72277 -0.08895  0.00000  0.68573  3.29145
qplot(var_train_num_NORM[1:7000,23],
      geom="histogram",
      main = "Histogram for Household.Head.Age", 
      xlab = "Household.Head.Age",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,24])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.5950 -0.7207 -0.2836  0.0000  0.5907  6.7108
qplot(var_train_num_NORM[1:7000,24],
      geom="histogram",
      main = "Histogram for Total.Number.of.Family.members", 
      xlab = "Total.Number.of.Family.members",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,25])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.0818 -1.0818 -0.2370  0.0000  0.6078  5.6765
qplot(var_train_num_NORM[1:7000,25],
      geom="histogram",
      main = "Histogram for Total.number.of.family.members.employed", 
      xlab = "Total.number.of.family.members.employed",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,26])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.8974 -0.5430 -0.2772  0.0000  0.2545 16.6989
qplot(var_train_num_NORM[1:7000,26],
      geom="histogram",
      main = "Histogram for House.Floor.Area", 
      xlab = "House.Floor.Area",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,27])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.4055 -0.7074 -0.2187  0.0000  0.4794  9.0662
qplot(var_train_num_NORM[1:7000,27],
      geom="histogram",
      main = "Histogram for House.Age", 
      xlab = "House.Age",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,28])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.6218 -0.7170  0.1878  0.0000  0.1878  6.5214
qplot(var_train_num_NORM[1:7000,28],
      geom="histogram",
      main = "Histogram for Number.of.bedrooms", 
      xlab = "Number.of.bedrooms",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,29])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.3539  0.2165  0.2165  0.0000  0.2165  8.0686
qplot(var_train_num_NORM[1:7000,29],
      geom="histogram",
      main = "Histogram for Number.of.Television", 
      xlab = "Number.of.Television",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,30])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.8054 -0.8054 -0.8054  0.0000  0.9890  6.3722
qplot(var_train_num_NORM[1:7000,30],
      geom="histogram",
      main = "Histogram for Number.of.CD.VCD.DVD", 
      xlab = "Number.of.CD.VCD.DVD",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,31])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.4168 -0.4168 -0.4168  0.0000 -0.4168 12.3251
qplot(var_train_num_NORM[1:7000,31],
      geom="histogram",
      main = "Histogram for Number.of.Component.Stereo.set", 
      xlab = "Number.of.Component.Stereo.set",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,32])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -0.755  -0.755  -0.755   0.000   1.101   8.527
qplot(var_train_num_NORM[1:7000,32],
      geom="histogram",
      main = "Histogram for Number.of.Refrigerator.Freezer", 
      xlab = "Number.of.Refrigerator.Freezer",  
      fill=I("blue"), 
      col=I("red")) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,33])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.6782 -0.6782 -0.6782  0.0000  1.4013  5.5605
qplot(var_train_num_NORM[1:7000,33],
      geom="histogram",
      main = "Histogram for Number.of.Washing.Machine", 
      xlab = "Number.of.Washing.Machine",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,34])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2882 -0.2882 -0.2882  0.0000 -0.2882 10.8704
qplot(var_train_num_NORM[1:7000,34],
      geom="histogram",
      main = "Histogram for Number.of.Airconditioner", 
      xlab = "Number.of.Airconditioner",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,35])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -0.235  -0.235  -0.235   0.000  -0.235  14.452
qplot(var_train_num_NORM[1:7000,35],
      geom="histogram",
      main = "Histogram for Number.of.Car..Jeep..Van", 
      xlab = "Number.of.Car..Jeep..Van",  
      fill=I("blue"), 
      col=I("red")) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,36])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2171 -0.2171 -0.2171  0.0000 -0.2171 13.8570
qplot(var_train_num_NORM[1:7000,36],
      geom="histogram",
      main = "Histogram for Number.of.Landline.wireless.telephones", 
      xlab = "Number.of.Landline.wireless.telephones",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,37])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.24618 -0.60702  0.03214  0.00000  0.67130  5.14543
qplot(var_train_num_NORM[1:7000,37],
      geom="histogram",
      main = "Histogram for Number.of.Cellular.phone", 
      xlab = "Number.of.Cellular.phone",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,38])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.4435 -0.4435 -0.4435  0.0000 -0.4435  7.6520
qplot(var_train_num_NORM[1:7000,38],
      geom="histogram",
      main = "Histogram for Number.of.Personal.Computer", 
      xlab = "Number.of.Personal.Computer",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,39])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.3827 -0.3827 -0.3827  0.0000 -0.3827  8.1497
qplot(var_train_num_NORM[1:7000,39],
      geom="histogram",
      main = "Histogram for Number.of.Stove.with.Oven.Gas.Range", 
      xlab = "Number.of.Stove.with.Oven.Gas.Range",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,40])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.0927 -0.0927 -0.0927  0.0000 -0.0927 24.8646
qplot(var_train_num_NORM[1:7000,40],
      geom="histogram",
      main = "Histogram for Number.of.Motorized.Banca", 
      xlab = "Number.of.Motorized.Banca",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(var_train_num_NORM[1:7000,41])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.5392 -0.5392 -0.5392  0.0000  1.2472  8.3928
qplot(var_train_num_NORM[1:7000,41],
      geom="histogram",
      main = "Histogram for Number.of.Motorcycle.Tricycle", 
      xlab = "Number.of.Motorcycle.Tricycle",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

La normalización produce que algunas variables muestren una distribución normal, o casi normal. Sin embargo, muchas de ellas siguen sin adquirir dicha distribución. A continuación, se procede a aplicar otra transformación distinta a las variables: el logaritmo decimal.

# Transformación logarítmica, que produce que los valores iguales a 0 se transformen a -Inf (por la definición del logaritmo)

var_train_num_Log<- log(var_train_num)

# Se imputan los valores -Infinito a valor 0, para no entorpecer la visualización y el procesado

Log_sin_inf <- replace(var_train_num_Log,var_train_num_Log=="-Inf",0) 

# Histogramas de las variables cuantitativas transformadas con el logaritmo


summary(Log_sin_inf[1:7000,1])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   9.789  11.572  12.018  12.097  12.595  15.413
qplot(Log_sin_inf[1:7000,1],
      geom="histogram",
      main = "Histogram for Total Household Income", 
      xlab = "Total Household Income",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,2])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.744  10.848  11.203  11.200  11.576  13.487
qplot(Log_sin_inf[1:7000,2],
      geom="histogram",
      main = "Histogram for Total Food Expenditure", 
      xlab = "Total Food Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,3])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   9.721  10.052   9.997  10.348  12.753
qplot(Log_sin_inf[1:7000,3],
      geom="histogram",
      main = "Histogram for Bread.and.Cereals.Expenditure", 
      xlab = "Bread.and.Cereals.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,4])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   9.297   9.709   9.447  10.082  12.748
qplot(Log_sin_inf[1:7000,4],
      geom="histogram",
      main = "Histogram for Total Rice Expenditure", 
      xlab = "Total Rice Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,5])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   8.112   8.917   8.718   9.565  12.474
qplot(Log_sin_inf[1:7000,5],
      geom="histogram",
      main = "Histogram for Meat.Expenditure", 
      xlab = "Meat.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,6])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   8.611   9.065   8.996   9.489  11.311
qplot(Log_sin_inf[1:7000,6],
      geom="histogram",
      main = "Histogram for Total.Fish.and..marine.products.Expenditure", 
      xlab = "Total.Fish.and..marine.products.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,7])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   6.920   7.512   7.455   8.044  11.322
qplot(Log_sin_inf[1:7000,7],
      geom="histogram",
      main = "Histogram for Fruit.Expenditure", 
      xlab = "Fruit.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,8])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   7.964   8.388   8.297   8.764  10.816
qplot(Log_sin_inf[1:7000,8],
      geom="histogram",
      main = "Histogram for Vegetables.Expenditure", 
      xlab = "Vegetables.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,9])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   7.611   8.920   8.124   9.947  12.953
qplot(Log_sin_inf[1:7000,9],
      geom="histogram",
      main = "Histogram for Restaurant.and.hotels.Expenditure", 
      xlab = "Restaurant.and.hotels.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,10])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   5.620   4.103   7.170  10.551
qplot(Log_sin_inf[1:7000,10],
      geom="histogram",
      main = "Histogram for Alcoholic.Beverages.Expenditure", 
      xlab = "Alcoholic.Beverages.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,11])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   5.273   4.044   8.046  11.490
qplot(Log_sin_inf[1:7000,11],
      geom="histogram",
      main = "Histogram for Tobacco.Expenditure", 
      xlab = "Tobacco.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,12])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   7.224   7.929   7.848   8.661  11.634
qplot(Log_sin_inf[1:7000,12],
      geom="histogram",
      main = "Histogram for Clothing..Footwear.and.Other.Wear.Expenditure", 
      xlab = "Clothing..Footwear.and.Other.Wear.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,13])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   7.745   9.483  10.050  10.139  10.761  14.084
qplot(Log_sin_inf[1:7000,13],
      geom="histogram",
      main = "Histogram for Housing.and.water.Expenditure", 
      xlab = "Housing.and.water.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,14])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   8.700   9.287   8.979  10.086  13.998
qplot(Log_sin_inf[1:7000,14],
      geom="histogram",
      main = "Histogram for Imputed.House.Rental.Value", 
      xlab = "Imputed.House.Rental.Value",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,15])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.645   7.002   6.898   8.433  13.419
qplot(Log_sin_inf[1:7000,15],
      geom="histogram",
      main = "Histogram for Medical.Care.Expenditure", 
      xlab = "Medical.Care.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,16])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   7.797   8.739   8.605   9.542  12.388
qplot(Log_sin_inf[1:7000,16],
      geom="histogram",
      main = "Histogram for Transportation.Expenditure", 
      xlab = "Transportation.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,17])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   6.356   7.352   6.842   8.293  11.381
qplot(Log_sin_inf[1:7000,17],
      geom="histogram",
      main = "Histogram for Communication.Expenditure", 
      xlab = "Communication.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,18])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   6.785   5.505   8.324  12.889
qplot(Log_sin_inf[1:7000,18],
      geom="histogram",
      main = "Histogram for Education.Expenditure", 
      xlab = "Education.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,19])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.890   8.259   8.834   8.917   9.558  12.585
qplot(Log_sin_inf[1:7000,19],
      geom="histogram",
      main = "Histogram for Miscellaneous.Goods.and.Services.Expenditure", 
      xlab = "Miscellaneous.Goods.and.Services.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,20])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   7.313   5.590   8.517  12.737
qplot(Log_sin_inf[1:7000,20],
      geom="histogram",
      main = "Histogram for Special.Occasions.Expenditure", 
      xlab = "Special.Occasions.Expenditure",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,21])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   2.965   8.794  14.392
qplot(Log_sin_inf[1:7000,21],
      geom="histogram",
      main = "Histogram for Crop.Farming.and.Gardening.expenses", 
      xlab = "Crop.Farming.and.Gardening.expenses",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,22])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   9.852   6.878  11.097  15.384
qplot(Log_sin_inf[1:7000,22],
      geom="histogram",
      main = "Histogram for Total.Income.from.Entrepreneurial.Acitivites", 
      xlab = "Total.Income.from.Entrepreneurial.Acitivites",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,23])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.708   3.714   3.912   3.896   4.111   4.585
qplot(Log_sin_inf[1:7000,23],
      geom="histogram",
      main = "Histogram for Household.Head.Age", 
      xlab = "Household.Head.Age",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,24])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.099   1.386   1.401   1.792   2.996
qplot(Log_sin_inf[1:7000,24],
      geom="histogram",
      main = "Histogram for Total.Number.of.Family.members", 
      xlab = "Total.Number.of.Family.members",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,25])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3185  0.6931  2.0794
qplot(Log_sin_inf[1:7000,25],
      geom="histogram",
      main = "Histogram for Total.number.of.family.members.employed", 
      xlab = "Total.number.of.family.members.employed",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,26])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.609   3.219   3.689   3.718   4.248   6.906
qplot(Log_sin_inf[1:7000,26],
      geom="histogram",
      main = "Histogram for House.Floor.Area", 
      xlab = "House.Floor.Area",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,27])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.303   2.833   2.696   3.296   5.011
qplot(Log_sin_inf[1:7000,27],
      geom="histogram",
      main = "Histogram for House.Age", 
      xlab = "House.Age",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,28])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.6931  0.5125  0.6931  2.1972
qplot(Log_sin_inf[1:7000,28],
      geom="histogram",
      main = "Histogram for Number.of.bedrooms", 
      xlab = "Number.of.bedrooms",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,29])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06998 0.00000 1.79176
qplot(Log_sin_inf[1:7000,29],
      geom="histogram",
      main = "Histogram for Number.of.Television", 
      xlab = "Number.of.Television",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,30])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01724 0.00000 1.38629
qplot(Log_sin_inf[1:7000,30],
      geom="histogram",
      main = "Histogram for Number.of.CD.VCD.DVD", 
      xlab = "Number.of.CD.VCD.DVD",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,31])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.004224 0.000000 1.609438
qplot(Log_sin_inf[1:7000,31],
      geom="histogram",
      main = "Histogram for Number.of.Component.Stereo.set", 
      xlab = "Number.of.Component.Stereo.set",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,32])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0113  0.0000  1.6094
qplot(Log_sin_inf[1:7000,32],
      geom="histogram",
      main = "Histogram for Number.of.Refrigerator.Freezer", 
      xlab = "Number.of.Refrigerator.Freezer",  
      fill=I("blue"), 
      col=I("red")) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,33])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.00354 0.00000 1.09861
qplot(Log_sin_inf[1:7000,33],
      geom="histogram",
      main = "Histogram for Number.of.Washing.Machine", 
      xlab = "Number.of.Washing.Machine",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,34])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01779 0.00000 1.60944
qplot(Log_sin_inf[1:7000,34],
      geom="histogram",
      main = "Histogram for Number.of.Airconditioner", 
      xlab = "Number.of.Airconditioner",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,35])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.009669 0.000000 1.609438
qplot(Log_sin_inf[1:7000,35],
      geom="histogram",
      main = "Histogram for Number.of.Car..Jeep..Van", 
      xlab = "Number.of.Car..Jeep..Van",  
      fill=I("blue"), 
      col=I("red")) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,36])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.004284 0.000000 1.386294
qplot(Log_sin_inf[1:7000,36],
      geom="histogram",
      main = "Histogram for Number.of.Landline.wireless.telephones", 
      xlab = "Number.of.Landline.wireless.telephones",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,37])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.6931  0.5659  1.0986  2.3026
qplot(Log_sin_inf[1:7000,37],
      geom="histogram",
      main = "Histogram for Number.of.Cellular.phone", 
      xlab = "Number.of.Cellular.phone",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,38])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06562 0.00000 1.79176
qplot(Log_sin_inf[1:7000,38],
      geom="histogram",
      main = "Histogram for Number.of.Personal.Computer", 
      xlab = "Number.of.Personal.Computer",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,39])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.002335 0.000000 1.098612
qplot(Log_sin_inf[1:7000,39],
      geom="histogram",
      main = "Histogram for Number.of.Stove.with.Oven.Gas.Range", 
      xlab = "Number.of.Stove.with.Oven.Gas.Range",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,40])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.000908 0.000000 1.098612
qplot(Log_sin_inf[1:7000,40],
      geom="histogram",
      main = "Histogram for Number.of.Motorized.Banca", 
      xlab = "Number.of.Motorized.Banca",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(Log_sin_inf[1:7000,41])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.02656 0.00000 1.60944
qplot(Log_sin_inf[1:7000,41],
      geom="histogram",
      main = "Histogram for Number.of.Motorcycle.Tricycle", 
      xlab = "Number.of.Motorcycle.Tricycle",  
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Con esta transformación, se consiguen resultados mejores con respecto a la normalización. Muchas variables adquieren distribuciones normales o casi normales, lo que permitirá poder utilizarlas en el diseño del modelo. Sin embargo, hay otras que supondrían un problema, pues tienen distribuciones muy asimétricas.

Algunas variables presentan un porcentaje muy alto de valores iguales a 0, lo que produce un polo en el extremo izquierdo de la distribución. Esto es debido a que la población filipina cuenta con un gran número de familias que viven en condiciones extremas de pobreza (aunque desde 2018, su situación económica está mejorando considerablemente).

En el siguiente apartado, se tendrá en cuenta lo analizado, para descartar aquellas variables cuya distribución no encaje con los requisitos, y partiendo del conjunto de datos transformados logarítmicamente.

6. Análisis de correlación

Recordamos las condiciones óptimas de cualquier modelo de regresión lineal múltiple:

  • Las variables tienen que tener distribución normal y ser simétricas
  • Las variables no pueden estar altamente correlacionadas entre sí

Por lo tanto, y a la vista del apartado anterior, se partirá del conjunto de variables transformadas logarítmicamente. Además, serán descartadas aquellas variables con distribuciones claramente asimétricas. Después, mediante un análisis de la correlación entre pares de las variables matrices, se rechazarán las variables altamente correladas entre sí a la hora de diseñar el modelo de regresión lineal múltiple.

# ----- Descarte de variables que no tienen distribuciones normales/simétricas -----

Log_reduced <- Log_sin_inf%>%select(
-Restaurant.and.hotels.Expenditure,
-Alcoholic.Beverages.Expenditure,    
-Tobacco.Expenditure,                          
-Imputed.House.Rental.Value,                 
-Medical.Care.Expenditure,                     
-Communication.Expenditure,                   
-Education.Expenditure,                        
-Total.number.of.family.members.employed,  
-Special.Occasions.Expenditure,              
-Crop.Farming.and.Gardening.expenses,
-Total.Income.from.Entrepreneurial.Acitivites,
-Number.of.bedrooms,                           
-Number.of.Television,                         
-Number.of.CD.VCD.DVD,                       
-Number.of.Component.Stereo.set,                
-Number.of.Refrigerator.Freezer,              
-Number.of.Washing.Machine,                    
-Number.of.Airconditioner,                     
-Number.of.Car..Jeep..Van,                      
-Number.of.Landline.wireless.telephones,       
-Number.of.Cellular.phone,                      
-Number.of.Personal.Computer,                 
-Number.of.Stove.with.Oven.Gas.Range,           
-Number.of.Motorized.Banca,                   
-Number.of.Motorcycle.Tricycle)

# Cálculo de la matriz de correlaciones cruzadas

cor_matrix_log_reduced <- round(cor(Log_reduced),4)
#----- Mapa de calor de la matriz de correlaciones cruzadas----------

mapa_corr <- melt(cor_matrix_log_reduced)
ggplot(data = mapa_corr, aes(x =X1, y =X2, fill =value)) + geom_tile() + theme(axis.text.x = element_text(angle = 60, vjust= 1, size = 6, hjust = 1)) + theme(axis.text.y = element_text( vjust= 1, size = 5, hjust = 1))

7. Selección de variables y modelo

Es conveniente evitar variables altamente correlacionadas entre sí, descartando de cada par la que más correlada esté con todas las demás. En el análisis no se incluirá la variable a predecir (“Total.Household.Income”), pues en ese caso la alta correlación sí es interesante.

# ----- Selección de variables ----- #

# Subconjunto sin la variable "income" a predecir

sin_income_log <- Log_reduced[,c(2:length(Log_reduced))]

# Descarte de variables altamente correlacionadas (findCorrelation)

index_log<-findCorrelation(cor(sin_income_log),cutoff =.5,verbose = TRUE,exact = TRUE)
## Compare row 1  and column  11 with corr  0.757 
##   Means:  0.488 vs 0.309 so flagging column 1 
## Compare row 11  and column  4 with corr  0.546 
##   Means:  0.399 vs 0.283 so flagging column 11 
## Compare row 4  and column  5 with corr  0.572 
##   Means:  0.375 vs 0.267 so flagging column 4 
## Compare row 2  and column  5 with corr  0.549 
##   Means:  0.342 vs 0.248 so flagging column 2 
## Compare row 9  and column  10 with corr  0.551 
##   Means:  0.314 vs 0.23 so flagging column 9 
## Compare row 5  and column  7 with corr  0.696 
##   Means:  0.293 vs 0.21 so flagging column 5 
## All correlations <= 0.5
sin_income_log <- sin_income_log%>%select(-index_log)


# Con el nuevo conjunto de variables, se calcula la matriz de correlación

new_var_train_log<-cbind(Total.Household.Income=Log_reduced[,1],sin_income_log)

cor_mat_log<-cor(new_var_train_log)

cor_mat_log<-cor_mat_log[,order(cor_mat_log[1,],decreasing = T)]

ggcorrplot(t(cor_mat_log), method = "circle") # Representación gráfica del mapa de calor

# Se escogerán las que tengan una correlación > de 0.5 con respecto al "Total.Household.Income"

Variables_ordenadas<-data.frame(t(cor_mat_log)[,'Total.Household.Income']) # Es para quedarse con la columna ordenada
colnames(Variables_ordenadas)<-'Coef. Corr'
View(Variables_ordenadas)

8. Ajuste, interpretación y diagnosis del modelo de regresión lineal múltiple

# Se realiza la regresión lineal múltiple con las variables cuyo valor de correlación cruzada es superior a 0.5

RLM<-lm(Total.Household.Income~Transportation.Expenditure
          +Clothing..Footwear.and.Other.Wear.Expenditure
          +Fruit.Expenditure
          ,data=new_var_train_log)


# Cálculo de residuos del modelo

residuos <- rstandard(RLM)


# Ajuste de valores de residuos - comprobación de normalidad

valores.ajustados <- fitted(RLM)

# Verificación de la no relación lineal entre valores predichos y  residuos

plot(valores.ajustados, residuos)

# Valores de los betas estimados en la regresión lineal múltiple

summary(RLM)
## 
## Call:
## lm(formula = Total.Household.Income ~ Transportation.Expenditure + 
##     Clothing..Footwear.and.Other.Wear.Expenditure + Fruit.Expenditure, 
##     data = new_var_train_log)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4600 -0.3315 -0.0430  0.2857  4.1229 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                   7.460310   0.051031  146.19
## Transportation.Expenditure                    0.228790   0.004760   48.06
## Clothing..Footwear.and.Other.Wear.Expenditure 0.145010   0.005051   28.71
## Fruit.Expenditure                             0.205250   0.007118   28.84
##                                               Pr(>|t|)    
## (Intercept)                                     <2e-16 ***
## Transportation.Expenditure                      <2e-16 ***
## Clothing..Footwear.and.Other.Wear.Expenditure   <2e-16 ***
## Fruit.Expenditure                               <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4992 on 6996 degrees of freedom
## Multiple R-squared:  0.5792, Adjusted R-squared:  0.579 
## F-statistic:  3210 on 3 and 6996 DF,  p-value: < 2.2e-16

Para comprobar que el resultado es correcto, se observa en la siguiente gráfica si los residuos siguen una distribución normal. Para ello, se representa el gráfico Q-Q que compara los cuantiles teóricos de una normal con los calculados. Cuantos más puntos caigan en la recta, mejor.

qqnorm(residuos)

qqline(residuos)

Validación del modelo

# Del conjunto de test, se seleccionan las variables adecuadas

pre_datos_testing<-datos_testing%>%select(Transportation.Expenditure
          ,Clothing..Footwear.and.Other.Wear.Expenditure
        ,Fruit.Expenditure)


# Es necesario transformar logarítmicamente el conjunto de test antes de usarlo para validar, pues el conjunto de train estaba transformado logarítmicamente

pre_datos_testing<-log(pre_datos_testing)


# Quitamos los valores -Inf transformandolos a 0

pre_datos_testing <- replace(pre_datos_testing,pre_datos_testing=="-Inf",0) 


# Predicción con el modelo de RLM calculado

ic <- predict(RLM,pre_datos_testing)


# Se obtienen, en un vector, los valores reales para compararlos con los predichos. Para ello, se calculan sus residuos

Valores_reales<-log(datos_testing$Total.Household.Income)

Valores_predichos<-ic


# Calculamos los residuos

residuos<-Valores_reales-Valores_predichos


# Verificación de la no relación lineal entre valores predichos y  residuos

plot(Valores_predichos,residuos)

Para comprobar que el resultado es correcto, se observa en la siguiente gráfica si los residuos siguen una distribución normal. Para ello, se representa el gráfico Q-Q que compara los cuantiles teóricos de una normal con los calculados. Cuantos más puntos caigan en la recta, mejor.

qqnorm(residuos)

qqline(residuos)